Skip to content

Managed LLM Relay for
Your AI Workers

Route inference requests to your GPU workers through a secure, authenticated relay. Always-on queueing, streaming, and cancellation.

OpenAI Chat Completions Anthropic Messages OpenAI Responses
Open source · No vendor lock-in · Cancel anytime · Desktop app available

Why ModelRelay?

Secure WebSocket Relay

Workers connect via authenticated WebSocket from anywhere — behind NAT, in the cloud, or on bare metal. No inbound ports required.

Multi-API Compatible

Supports OpenAI Chat Completions, Anthropic Messages, and the new OpenAI Responses API. Use any SDK — just point it at ModelRelay.

Managed Infrastructure

Hosted on reliable Kubernetes infrastructure. No ops burden — we handle uptime, TLS, scaling, and monitoring so you can focus on models.

How It Works

1

Sign Up & Get Your API Key

Create an account, subscribe, and receive your unique API key. It takes less than a minute.

2

Run a Model

Use the desktop app for a one-click tray worker, the llamafile CLI for zero-setup (no GPU required), or connect a GPU worker for maximum performance. All connect via WebSocket — no inbound ports needed.

3

Point Your SDK at ModelRelay

Change one line — set your base_url to ModelRelay. Works with OpenAI, Anthropic, and any compatible SDK.

Zero-setup — no GPU required
# Install the CLI
curl -fsSL https://raw.githubusercontent.com/ericflo/modelrelay/main/extras/modelrelay-llamafile \
  -o modelrelay-llamafile && chmod +x modelrelay-llamafile

# Download and serve a model through ModelRelay
./modelrelay-llamafile config set proxy_url https://api.modelrelay.io
./modelrelay-llamafile config set worker_secret your-worker-secret
./modelrelay-llamafile serve qwen3.5-2b
Python — OpenAI SDK
# Just change the base_url to your ModelRelay endpoint
from openai import OpenAI

client = OpenAI(
    base_url="https://api.modelrelay.io/v1",
    api_key="your-modelrelay-api-key",
)

response = client.chat.completions.create(
    model="your-model",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
Python — Anthropic SDK
# Point the Anthropic SDK at your ModelRelay endpoint
from anthropic import Anthropic

client = Anthropic(
    base_url="https://api.modelrelay.io",
    api_key="your-modelrelay-api-key",
)

message = client.messages.create(
    model="your-model",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello!"}],
)
print(message.content[0].text)

Download the Desktop App

Run a ModelRelay worker from your system tray. No terminal needed — just install, configure, and go.

View all downloads on GitHub →

Simple, Transparent Pricing

One plan. Everything included. No per-request fees.

$20/mo
Flat rate, no usage surprises
Unlimited workers Unlimited requests WebSocket streaming All three API formats TLS & auth included
vs self-hosted: skip the Nginx configs, TLS certs, WebSocket reverse proxies, and monitoring dashboards
View Pricing Details →