Managed LLM Relay for
Your AI Workers
Route inference requests to your GPU workers through a secure, authenticated relay. Always-on queueing, streaming, and cancellation.
Why ModelRelay?
Secure WebSocket Relay
Workers connect via authenticated WebSocket from anywhere — behind NAT, in the cloud, or on bare metal. No inbound ports required.
Multi-API Compatible
Supports OpenAI Chat Completions, Anthropic Messages, and the new OpenAI Responses API. Use any SDK — just point it at ModelRelay.
Managed Infrastructure
Hosted on reliable Kubernetes infrastructure. No ops burden — we handle uptime, TLS, scaling, and monitoring so you can focus on models.
How It Works
Sign Up & Get Your API Key
Create an account, subscribe, and receive your unique API key. It takes less than a minute.
Run a Model
Use the desktop app for a one-click tray worker, the llamafile CLI for zero-setup (no GPU required), or connect a GPU worker for maximum performance. All connect via WebSocket — no inbound ports needed.
Point Your SDK at ModelRelay
Change one line — set your base_url to ModelRelay. Works with OpenAI, Anthropic, and any compatible SDK.
# Install the CLI curl -fsSL https://raw.githubusercontent.com/ericflo/modelrelay/main/extras/modelrelay-llamafile \ -o modelrelay-llamafile && chmod +x modelrelay-llamafile # Download and serve a model through ModelRelay ./modelrelay-llamafile config set proxy_url https://api.modelrelay.io ./modelrelay-llamafile config set worker_secret your-worker-secret ./modelrelay-llamafile serve qwen3.5-2b
# Just change the base_url to your ModelRelay endpoint from openai import OpenAI client = OpenAI( base_url="https://api.modelrelay.io/v1", api_key="your-modelrelay-api-key", ) response = client.chat.completions.create( model="your-model", messages=[{"role": "user", "content": "Hello!"}], ) print(response.choices[0].message.content)
# Point the Anthropic SDK at your ModelRelay endpoint from anthropic import Anthropic client = Anthropic( base_url="https://api.modelrelay.io", api_key="your-modelrelay-api-key", ) message = client.messages.create( model="your-model", max_tokens=1024, messages=[{"role": "user", "content": "Hello!"}], ) print(message.content[0].text)
Download the Desktop App
Run a ModelRelay worker from your system tray. No terminal needed — just install, configure, and go.
Simple, Transparent Pricing
One plan. Everything included. No per-request fees.