Multi-Provider LLM: How to Stop Depending on a Single API
What is multi-provider LLM architecture?
Multi-provider LLM architecture is an infrastructure pattern where an application routes AI requests across multiple language model providers — such as OpenAI, Anthropic, and DeepSeek — through a single proxy layer, enabling automatic failover, cost-based routing, and model swaps without code changes. It eliminates single-provider dependency that causes outages and locks in pricing.
TL;DR
- -OpenAI had 22 incidents in Dec 2025, Anthropic had 20 — 99.7% uptime still means ~26 hours of downtime per year
- -DeepSeek-V3 costs $0.14/1M input tokens vs GPT-4o at $2.50 — 18x spread means routing by task type cuts costs dramatically
- -LiteLLM proxy: 36,700 stars, 100+ providers, adds ~8ms P95 latency, single OpenAI-compatible endpoint for all providers
- -Switching models is a one-line config change — no SDK migration, no refactoring application code
- -Circuit breaker pattern prevents cascade failures: after N errors, the provider is marked unhealthy and traffic routes elsewhere
In December 2025, OpenAI had 22 incidents. Anthropic had 20. Average resolution time ran 8–9 hours per incident. Meanwhile, five other providers — Cohere, Google Gemini, Groq, Replicate, and xAI — reported zero incidents that same month.
99.7% uptime sounds solid. Still ~26 hours of downtime per year. For an application processing thousands of LLM requests daily, one bad month from a single provider means lost users.
This article covers how to build LLM infrastructure that survives outages, takes advantage of price differences, and lets you swap models without touching code.
Why Multiple Providers
Four reasons. Any one of them is enough.
Availability. Providers go down. Not occasionally — regularly. December 2025: the two market leaders accumulated 42 incidents in a single month. When your only provider is down, your app is down. With a fallback, requests route to another provider. The user doesn’t notice.
Rate limits. Providers change limits unilaterally. In the summer of 2025, Anthropic introduced weekly limits for heavy Claude Code users. OpenAI gates access through spending-based “tiers.” With a single provider, a sudden limit reduction cascades into outages with no backup plan.
Cost. The price spread across providers isn’t percentage-level — it’s orders of magnitude. DeepSeek-V3 costs $0.14 per million input tokens. GPT-4o costs $2.50. That’s 18x on input and 36x on output. Not every task needs the most expensive model. Classification, text extraction, embedding generation — all of it can route to cheap models without losing quality.
Deprecation. Models get retired. OpenAI removed the chatgpt-4o-latest snapshot from the API on February 17, 2026 and pulled GPT-4o from ChatGPT on February 13 — three months’ notice. GPT-4.5, launched in February 2025 at $75/$150 per million tokens, is gone too. Flagship model lifecycle: 12–24 months. An application locked to a specific model faces a forced migration every 1–2 years.
LiteLLM: Single Entry Point
LiteLLM is an open-source proxy that funnels calls to different LLM providers into a single OpenAI-compatible API. 36,700 GitHub stars, 100+ providers supported. The proxy itself adds about 8ms at P95 (per LiteLLM’s benchmarks).
Instead of calling provider APIs directly, every request goes through LiteLLM. It accepts standard /v1/chat/completions, routes to the right provider, and hands back the response in a unified format.
Application
│
│ POST /v1/chat/completions
│ model: "deepseek/deepseek-chat"
▼
┌──────────┐
│ LiteLLM │ → routing → DeepSeek API
│ Proxy │ → fallback → Google Gemini API
│ │ → fallback → Anthropic API
└──────────┘
│
│ OpenAI-compatible response
▼
Application
In practice:
- Switching models is one line. Change
deepseek/deepseek-chattogoogle/gemini-2.0-flash— that’s amodelparameter change. No refactoring, no SDK migration. - Unified format. No matter which provider handles the request, the response comes back as an OpenAI Chat Completion. Your client code can’t tell which provider processed it.
- Centralized auth. Provider API keys live in the LiteLLM config, not in every edge function. One LiteLLM key for the client, a dozen provider keys behind the scenes.
- Proxy-level rate limiting. RPM/TPM limits, per-user quotas, budget caps — all in one place.
Configuration
LiteLLM uses a YAML config file. Minimum setup for two providers:
model_list:
- model_name: fast-chat
litellm_params:
model: google/gemini-2.0-flash
api_key: os.environ/GOOGLE_API_KEY
- model_name: fast-chat # same model_name = fallback
litellm_params:
model: deepseek/deepseek-chat
api_key: os.environ/DEEPSEEK_API_KEY
- model_name: deep-analysis
litellm_params:
model: anthropic/claude-sonnet-4
api_key: os.environ/ANTHROPIC_API_KEY
router_settings:
routing_strategy: usage-based-routing
enable_pre_call_checks: true # check limits before calling
Two deployments with the same model_name — LiteLLM automatically routes between them and uses the second as a fallback when the first fails.
Routing Strategies
LiteLLM supports four strategies:
| Strategy | How it works | When to use |
|---|---|---|
simple-shuffle | Random selection | Default, when you don’t care |
least-busy | Routes to least loaded | Load balancing |
usage-based-routing | Filters by TPM/RPM limits | Stay within provider quotas |
latency-based-routing | Routes to fastest | Minimize response time |
usage-based-routing pulls the most weight in production. LiteLLM tracks current TPM/RPM consumption via Redis and excludes deployments approaching their limits. Each request lands on the deployment with the lowest current usage.
Fallback Chains: Primary → Secondary → Emergency
A fallback chain is a sequence of providers that triggers automatically on failure. First provider goes down — the request routes to the second. Second is overloaded — to the third.
Errors that trigger fallback:
- 429 — rate limit exceeded (provider overloaded)
- 500, 502, 503, 504 — server errors (provider is down)
Errors that don’t:
- 400 — invalid request (our code’s problem, not the provider’s)
- 401, 403 — key issue (fallback won’t help)
In LiteLLM, this works automatically: multiple deployments with the same model_name give you built-in fallback. For different model_name values, you configure a fallback list:
router_settings:
fallbacks: [
{"fast-chat": ["backup-chat"]},
{"deep-analysis": ["backup-analysis"]}
]
Practical Example
Three fallback levels for a chatbot:
- Primary:
google/gemini-2.0-flash— fast, cheap, good quality - Secondary:
deepseek/deepseek-chat— cheaper, slightly slower - Emergency:
anthropic/claude-3-haiku— more expensive, but stable
Gemini returns 503 — the request goes to DeepSeek. DeepSeek returns 429 (rate limit) — the request goes to Claude Haiku. The user gets a response, maybe a bit slower.
But different models produce different outputs. In a chatbot, that’s fine — users don’t compare responses across models. In a pipeline with a strict JSON schema, fallback between models needs output validation on top.
Task-Based Routing: Different Tasks → Different Models
Not all tasks are equal. Generating a travel itinerary demands reasoning and large context. Generating a chat title — 10 tokens in, 5 out. POI data enrichment — structured text parsing.
Routing everything to one model means overpaying or losing quality.
Classify the task, pick the right model.
| Task | Model | Why |
|---|---|---|
| AI chat (fast replies) | Gemini 2.0 Flash | Fast, cheap, good at conversation |
| Trip analysis, data extraction | DeepSeek Chat | Cheap, strong at structured output |
| Itinerary generation (pipeline) | DeepSeek Chat + validation | Complex task, but DeepSeek handles it with the right prompts |
| Title generation | Gemini 2.0 Flash | Trivial task, not worth an expensive model |
| Orchestration (multi-step agents) | Claude Haiku | Follows instructions well, predictable |
You specify the model per request through the model parameter. Since all calls go through LiteLLM, switching models means swapping a string.
Managing Models Through Langfuse
You can pull the model out of code and into prompt configuration. In Langfuse, each prompt stores config.model:
{
"name": "ai-chat-travel-assistant",
"config": {
"model": "google/gemini-2.0-flash",
"temperature": 0.7,
"max_tokens": 4096
}
}
Your edge function fetches the prompt from Langfuse, grabs the model from config, and passes it to LiteLLM:
const promptTemplate = await getLangfusePrompt('ai-chat-travel-assistant', langfuseConfig);
const model = promptTemplate.config?.model || 'google/gemini-2.0-flash';
const response = await fetch(`${LITELLM_URL}/v1/chat/completions`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${LITELLM_KEY}`,
},
body: JSON.stringify({
model,
messages: compiledMessages,
temperature: promptTemplate.config?.temperature ?? 0.7,
}),
});
Switching a model for any prompt takes a click in the Langfuse UI — no code deploy. Edit the prompt, mark it production, done.
More on Langfuse in the separate LLM observability article.
Cost: The Order of Magnitude Matters
Price gaps across providers aren’t linear. They’re orders of magnitude.
| Model | Input ($/1M) | Output ($/1M) | vs GPT-4o |
|---|---|---|---|
| DeepSeek-V3 | $0.14 | $0.28 | 18x cheaper |
| Mistral Medium 3 | $0.40 | $2.00 | 5–6x cheaper |
| Gemini 2.5 Pro | $1.25 | $10.00 | 1–2x cheaper |
| GPT-4o | $2.50 | $10.00 | baseline |
| Claude Sonnet 4 | $3.00 | $15.00 | 1.2–1.5x more |
| Claude Opus 4 | $15.00 | $75.00 | 6–7x more |
LMSYS researchers (RouteLLM) showed that smart routing slashes costs by 85%+ on the MT Bench benchmark with no noticeable quality loss. Their approach: 90% of “easy” requests hit a cheap model, 10% of “hard” ones hit an expensive model.
In practice, that means task-based model selection. Chat, title generation, data extraction — cheap models. Complex analysis, reasoning, multi-step agents — expensive ones.
Monitoring: How to Detect Provider Degradation
A provider can degrade without going fully down. Latency creeps from 200ms to 5 seconds. Error rate drifts from 0.1% to 3%. The model hallucinates more often.
What to monitor:
| Metric | Alert threshold | Action |
|---|---|---|
| P95 latency | > 2x baseline | Enable fallback |
| Error rate | > 2% | Enable fallback |
| Timeout rate | > 1% | Lower timeout, enable fallback |
| Token cost | Over budget | Switch to cheaper model |
LiteLLM logs every call: provider, model, latency, status, token count. Pick any visualization stack — Grafana, Datadog, custom dashboards. Langfuse adds prompt-level tracing: which prompt, which version, what result.
The single most telling metric: fallback-to-total request ratio. Over 10% hitting fallback? Your primary provider is degrading. Over 30%? Time to pick a new primary.
Circuit Breaker for LLM Calls
A circuit breaker stops cascading failures. When an external service starts returning errors consistently, the breaker “opens” and blocks outgoing requests. Instead of waiting 60 seconds per call to a dead provider, the system fails fast.
Three states:
CLOSED (normal) OPEN (service down) HALF-OPEN (testing)
│ │ │
│ 3 failures │ 60 seconds pass │ 1 success
│─────────────────► │──────────────────► │──────────────► CLOSED
│ │ │
│ │ requests rejected │ 1 failure
│ │ instantly │──────────────► OPEN
LLM calls need different settings than regular APIs. Models respond slower — timeout is 60 seconds instead of 10. Failure threshold drops to 3 instead of 5, because each LLM request is expensive. Recovery takes longer too — 60 seconds instead of 30.
const LLM_CIRCUIT_CONFIG = {
failureThreshold: 3, // 3 failures → circuit open
resetTimeoutMs: 60_000, // 60 seconds in open state
successThreshold: 1, // 1 success in half-open → closed
ignoredStatusCodes: [400, 404], // client errors don't count
};
Serverless throws a wrench in this. In Deno Edge Functions or AWS Lambda, each invocation can run in a fresh isolate. A circuit breaker that stores state in memory loses it when a new isolate spins up. Distributed circuit breaking needs external storage — Redis or a database table.
More on the implementation in the Circuit Breaker for Edge Functions article.
Alternatives to LiteLLM
LiteLLM isn’t the only option. Your priorities dictate the pick.
| Tool | Focus | Models | Cost | Good for |
|---|---|---|---|---|
| LiteLLM | SDK + proxy | 100+ | Open source | Developers, self-hosted |
| OpenRouter | Managed API | 500+ | 5.5% fee | Quick start, access to all models |
| Portkey | Enterprise gateway | 1,600+ | From $49/mo | Compliance, governance, teams |
| Helicone | Observability | Any | Free tier / $49 | Monitoring, caching |
OpenRouter — a managed alternative. No proxy to run yourself. 500+ models from 60+ providers, 5.5% fee on credit purchases; model prices pass through without markup. Raised $40M in June 2025; client inference run-rate topped $100M. A strong fit for prototyping and projects where self-hosted infrastructure is overkill.
Portkey — built for teams with compliance requirements. PII redaction, jailbreak detection, audit trails, SSO. If your project demands that level of security governance, start here.
Helicone — open-source, laser-focused on observability. Gateway built in Rust with ~8ms P50 latency. Ships with response caching that trims costs on repeated requests. Works well alongside LiteLLM, not as a replacement.
LiteLLM wins on control: self-hosted, full configuration access, free. For a production application juggling multiple providers, it offers the best ratio of control to operational effort.
Where This Doesn’t Work
Multi-provider isn’t free. Flexibility costs something.
Prompt caching breaks on fallback. Anthropic and OpenAI cache prompts to speed up repeated calls. When a request falls over to a backup provider, the primary’s cache sits idle. Long system prompts take a noticeable hit on both latency and cost. Advanced setups use project-level affinity — requests from the same project stick to the same provider when possible.
Response consistency. Different models produce different text. In a chatbot, that’s fine. In a pipeline with a strict JSON schema, it’s a risk. DeepSeek might return "rating": 4.5, Gemini might return "rating": "4.5". You must validate outputs.
Additional infrastructure. LiteLLM is a server you need to run, monitor, and update. With a single provider, an API key is enough. With five providers through LiteLLM — a Docker container, Redis for rate limiting, monitoring. Operational complexity stacks up.
Debugging gets harder. “Request failed” — which provider? Which fallback level? What error? You have to log every step: provider, model, latency, status, attempt number. Skip that, and you’re debugging blind.
Not all APIs are equal. The OpenAI-compatible format covers /chat/completions. Provider-specific features — vision API, function calling formats, streaming with tool use — can behave differently through a proxy. Before adding a provider to a fallback chain, test the specific scenarios you care about.
Getting Started
If your application runs on a single provider today, switching to multi-provider doesn’t require a rewrite.
Step 1: LiteLLM proxy. Spin up a Docker container. Wire up your current provider. Point all calls at the proxy. Nothing changes yet — same provider, same results. But now every LLM call flows through one chokepoint you control.
Step 2: Add a second provider as fallback. Drop in DeepSeek or Gemini Flash as a second deployment with the same model_name. LiteLLM switches to it automatically on primary failure. Test it — kill the primary provider manually.
Step 3: Task-based routing. Audit your calls. Which tasks burn through tokens? Which are trivial? Move the cheap ones to a cheap model. Title generation, classification, data extraction — DeepSeek. Chat, reasoning — Gemini or Claude.
Step 4: Monitoring. Connect Langfuse or equivalent. Trace every call: provider, model, latency, cost. Set up alerts for degradation.
The whole process — zero to production multi-provider — takes a couple of days. The LiteLLM proxy goes up in 30 minutes. Adding a provider is one line in config. Where you’ll actually spend time: testing fallback scenarios and wiring up monitoring.
FAQ
How does LiteLLM handle streaming responses during a mid-stream provider failure?
LiteLLM cannot transparently retry a streaming response that has already started — once tokens are flowing to the client, a mid-stream failure surfaces as a broken stream, not a seamless fallback. The fallback mechanism only activates before the first token is sent. For resilient streaming, the practical pattern is to set aggressive timeout values (timeout: 10) to catch slow providers early, and implement client-side reconnect logic that replays the request from scratch on stream error. Alternatively, disable streaming for critical requests where consistency matters more than time-to-first-token.
What is the realistic P99 latency penalty of routing through LiteLLM proxy compared to direct provider calls?
LiteLLM’s own benchmarks show ~8ms P95 overhead for the proxy hop. At P99, real-world observations in production environments typically show 15–30ms overhead — mainly from connection pool management and JSON serialization of the request/response. This is negligible compared to LLM inference latency (typically 500ms–3s P50). Where overhead compounds is cold-start scenarios: when a LiteLLM Docker container restarts, the first few requests take 200–400ms extra while connection pools warm up.
Can you use provider-specific features (like Anthropic’s extended thinking or OpenAI’s o-series reasoning tokens) through LiteLLM?
Provider-specific parameters can be passed through LiteLLM using the extra_body parameter, which is forwarded as-is to the provider’s API. For Anthropic’s extended thinking, set extra_body={"thinking": {"type": "enabled", "budget_tokens": 5000}}. However, these parameters are not part of the unified OpenAI-compatible format, so they only work when the specific provider handles the request — if the request falls over to a fallback provider, the provider-specific parameters are ignored rather than causing an error. Test fallback behavior explicitly when using extended thinking or reasoning-mode models.