Tutorials

Multi-Provider LLM: How to Stop Depending on a Single API

What is multi-provider LLM architecture?

Multi-provider LLM architecture is an infrastructure pattern where an application routes AI requests across multiple language model providers — such as OpenAI, Anthropic, and DeepSeek — through a single proxy layer, enabling automatic failover, cost-based routing, and model swaps without code changes. It eliminates single-provider dependency that causes outages and locks in pricing.

TL;DR

  • -OpenAI had 22 incidents in Dec 2025, Anthropic had 20 — 99.7% uptime still means ~26 hours of downtime per year
  • -DeepSeek-V3 costs $0.14/1M input tokens vs GPT-4o at $2.50 — 18x spread means routing by task type cuts costs dramatically
  • -LiteLLM proxy: 36,700 stars, 100+ providers, adds ~8ms P95 latency, single OpenAI-compatible endpoint for all providers
  • -Switching models is a one-line config change — no SDK migration, no refactoring application code
  • -Circuit breaker pattern prevents cascade failures: after N errors, the provider is marked unhealthy and traffic routes elsewhere

In December 2025, OpenAI had 22 incidents. Anthropic had 20. Average resolution time ran 8–9 hours per incident. Meanwhile, five other providers — Cohere, Google Gemini, Groq, Replicate, and xAI — reported zero incidents that same month.

99.7% uptime sounds solid. Still ~26 hours of downtime per year. For an application processing thousands of LLM requests daily, one bad month from a single provider means lost users.

This article covers how to build LLM infrastructure that survives outages, takes advantage of price differences, and lets you swap models without touching code.

Why Multiple Providers

Four reasons. Any one of them is enough.

Availability. Providers go down. Not occasionally — regularly. December 2025: the two market leaders accumulated 42 incidents in a single month. When your only provider is down, your app is down. With a fallback, requests route to another provider. The user doesn’t notice.

Rate limits. Providers change limits unilaterally. In the summer of 2025, Anthropic introduced weekly limits for heavy Claude Code users. OpenAI gates access through spending-based “tiers.” With a single provider, a sudden limit reduction cascades into outages with no backup plan.

Cost. The price spread across providers isn’t percentage-level — it’s orders of magnitude. DeepSeek-V3 costs $0.14 per million input tokens. GPT-4o costs $2.50. That’s 18x on input and 36x on output. Not every task needs the most expensive model. Classification, text extraction, embedding generation — all of it can route to cheap models without losing quality.

Deprecation. Models get retired. OpenAI removed the chatgpt-4o-latest snapshot from the API on February 17, 2026 and pulled GPT-4o from ChatGPT on February 13 — three months’ notice. GPT-4.5, launched in February 2025 at $75/$150 per million tokens, is gone too. Flagship model lifecycle: 12–24 months. An application locked to a specific model faces a forced migration every 1–2 years.

LiteLLM: Single Entry Point

LiteLLM is an open-source proxy that funnels calls to different LLM providers into a single OpenAI-compatible API. 36,700 GitHub stars, 100+ providers supported. The proxy itself adds about 8ms at P95 (per LiteLLM’s benchmarks).

Instead of calling provider APIs directly, every request goes through LiteLLM. It accepts standard /v1/chat/completions, routes to the right provider, and hands back the response in a unified format.

Application

    │  POST /v1/chat/completions
    │  model: "deepseek/deepseek-chat"

┌──────────┐
│  LiteLLM │ → routing → DeepSeek API
│  Proxy   │ → fallback → Google Gemini API
│          │ → fallback → Anthropic API
└──────────┘

    │  OpenAI-compatible response

Application

In practice:

  • Switching models is one line. Change deepseek/deepseek-chat to google/gemini-2.0-flash — that’s a model parameter change. No refactoring, no SDK migration.
  • Unified format. No matter which provider handles the request, the response comes back as an OpenAI Chat Completion. Your client code can’t tell which provider processed it.
  • Centralized auth. Provider API keys live in the LiteLLM config, not in every edge function. One LiteLLM key for the client, a dozen provider keys behind the scenes.
  • Proxy-level rate limiting. RPM/TPM limits, per-user quotas, budget caps — all in one place.

Configuration

LiteLLM uses a YAML config file. Minimum setup for two providers:

model_list:
  - model_name: fast-chat
    litellm_params:
      model: google/gemini-2.0-flash
      api_key: os.environ/GOOGLE_API_KEY
  - model_name: fast-chat           # same model_name = fallback
    litellm_params:
      model: deepseek/deepseek-chat
      api_key: os.environ/DEEPSEEK_API_KEY
  - model_name: deep-analysis
    litellm_params:
      model: anthropic/claude-sonnet-4
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  routing_strategy: usage-based-routing
  enable_pre_call_checks: true       # check limits before calling

Two deployments with the same model_name — LiteLLM automatically routes between them and uses the second as a fallback when the first fails.

Routing Strategies

LiteLLM supports four strategies:

StrategyHow it worksWhen to use
simple-shuffleRandom selectionDefault, when you don’t care
least-busyRoutes to least loadedLoad balancing
usage-based-routingFilters by TPM/RPM limitsStay within provider quotas
latency-based-routingRoutes to fastestMinimize response time

usage-based-routing pulls the most weight in production. LiteLLM tracks current TPM/RPM consumption via Redis and excludes deployments approaching their limits. Each request lands on the deployment with the lowest current usage.

Fallback Chains: Primary → Secondary → Emergency

A fallback chain is a sequence of providers that triggers automatically on failure. First provider goes down — the request routes to the second. Second is overloaded — to the third.

Errors that trigger fallback:

  • 429 — rate limit exceeded (provider overloaded)
  • 500, 502, 503, 504 — server errors (provider is down)

Errors that don’t:

  • 400 — invalid request (our code’s problem, not the provider’s)
  • 401, 403 — key issue (fallback won’t help)

In LiteLLM, this works automatically: multiple deployments with the same model_name give you built-in fallback. For different model_name values, you configure a fallback list:

router_settings:
  fallbacks: [
    {"fast-chat": ["backup-chat"]},
    {"deep-analysis": ["backup-analysis"]}
  ]

Practical Example

Three fallback levels for a chatbot:

  1. Primary: google/gemini-2.0-flash — fast, cheap, good quality
  2. Secondary: deepseek/deepseek-chat — cheaper, slightly slower
  3. Emergency: anthropic/claude-3-haiku — more expensive, but stable

Gemini returns 503 — the request goes to DeepSeek. DeepSeek returns 429 (rate limit) — the request goes to Claude Haiku. The user gets a response, maybe a bit slower.

But different models produce different outputs. In a chatbot, that’s fine — users don’t compare responses across models. In a pipeline with a strict JSON schema, fallback between models needs output validation on top.

Task-Based Routing: Different Tasks → Different Models

Not all tasks are equal. Generating a travel itinerary demands reasoning and large context. Generating a chat title — 10 tokens in, 5 out. POI data enrichment — structured text parsing.

Routing everything to one model means overpaying or losing quality.

Classify the task, pick the right model.

TaskModelWhy
AI chat (fast replies)Gemini 2.0 FlashFast, cheap, good at conversation
Trip analysis, data extractionDeepSeek ChatCheap, strong at structured output
Itinerary generation (pipeline)DeepSeek Chat + validationComplex task, but DeepSeek handles it with the right prompts
Title generationGemini 2.0 FlashTrivial task, not worth an expensive model
Orchestration (multi-step agents)Claude HaikuFollows instructions well, predictable

You specify the model per request through the model parameter. Since all calls go through LiteLLM, switching models means swapping a string.

Managing Models Through Langfuse

You can pull the model out of code and into prompt configuration. In Langfuse, each prompt stores config.model:

{
  "name": "ai-chat-travel-assistant",
  "config": {
    "model": "google/gemini-2.0-flash",
    "temperature": 0.7,
    "max_tokens": 4096
  }
}

Your edge function fetches the prompt from Langfuse, grabs the model from config, and passes it to LiteLLM:

const promptTemplate = await getLangfusePrompt('ai-chat-travel-assistant', langfuseConfig);
const model = promptTemplate.config?.model || 'google/gemini-2.0-flash';

const response = await fetch(`${LITELLM_URL}/v1/chat/completions`, {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Authorization': `Bearer ${LITELLM_KEY}`,
  },
  body: JSON.stringify({
    model,
    messages: compiledMessages,
    temperature: promptTemplate.config?.temperature ?? 0.7,
  }),
});

Switching a model for any prompt takes a click in the Langfuse UI — no code deploy. Edit the prompt, mark it production, done.

More on Langfuse in the separate LLM observability article.

Cost: The Order of Magnitude Matters

Price gaps across providers aren’t linear. They’re orders of magnitude.

ModelInput ($/1M)Output ($/1M)vs GPT-4o
DeepSeek-V3$0.14$0.2818x cheaper
Mistral Medium 3$0.40$2.005–6x cheaper
Gemini 2.5 Pro$1.25$10.001–2x cheaper
GPT-4o$2.50$10.00baseline
Claude Sonnet 4$3.00$15.001.2–1.5x more
Claude Opus 4$15.00$75.006–7x more

LMSYS researchers (RouteLLM) showed that smart routing slashes costs by 85%+ on the MT Bench benchmark with no noticeable quality loss. Their approach: 90% of “easy” requests hit a cheap model, 10% of “hard” ones hit an expensive model.

In practice, that means task-based model selection. Chat, title generation, data extraction — cheap models. Complex analysis, reasoning, multi-step agents — expensive ones.

Monitoring: How to Detect Provider Degradation

A provider can degrade without going fully down. Latency creeps from 200ms to 5 seconds. Error rate drifts from 0.1% to 3%. The model hallucinates more often.

What to monitor:

MetricAlert thresholdAction
P95 latency> 2x baselineEnable fallback
Error rate> 2%Enable fallback
Timeout rate> 1%Lower timeout, enable fallback
Token costOver budgetSwitch to cheaper model

LiteLLM logs every call: provider, model, latency, status, token count. Pick any visualization stack — Grafana, Datadog, custom dashboards. Langfuse adds prompt-level tracing: which prompt, which version, what result.

The single most telling metric: fallback-to-total request ratio. Over 10% hitting fallback? Your primary provider is degrading. Over 30%? Time to pick a new primary.

Circuit Breaker for LLM Calls

A circuit breaker stops cascading failures. When an external service starts returning errors consistently, the breaker “opens” and blocks outgoing requests. Instead of waiting 60 seconds per call to a dead provider, the system fails fast.

Three states:

CLOSED (normal)           OPEN (service down)       HALF-OPEN (testing)
    │                         │                          │
    │  3 failures             │  60 seconds pass         │  1 success
    │─────────────────►       │──────────────────►       │──────────────►  CLOSED
    │                         │                          │
    │                         │  requests rejected       │  1 failure
    │                         │  instantly               │──────────────►  OPEN

LLM calls need different settings than regular APIs. Models respond slower — timeout is 60 seconds instead of 10. Failure threshold drops to 3 instead of 5, because each LLM request is expensive. Recovery takes longer too — 60 seconds instead of 30.

const LLM_CIRCUIT_CONFIG = {
  failureThreshold: 3,       // 3 failures → circuit open
  resetTimeoutMs: 60_000,    // 60 seconds in open state
  successThreshold: 1,       // 1 success in half-open → closed
  ignoredStatusCodes: [400, 404],  // client errors don't count
};

Serverless throws a wrench in this. In Deno Edge Functions or AWS Lambda, each invocation can run in a fresh isolate. A circuit breaker that stores state in memory loses it when a new isolate spins up. Distributed circuit breaking needs external storage — Redis or a database table.

More on the implementation in the Circuit Breaker for Edge Functions article.

Alternatives to LiteLLM

LiteLLM isn’t the only option. Your priorities dictate the pick.

ToolFocusModelsCostGood for
LiteLLMSDK + proxy100+Open sourceDevelopers, self-hosted
OpenRouterManaged API500+5.5% feeQuick start, access to all models
PortkeyEnterprise gateway1,600+From $49/moCompliance, governance, teams
HeliconeObservabilityAnyFree tier / $49Monitoring, caching

OpenRouter — a managed alternative. No proxy to run yourself. 500+ models from 60+ providers, 5.5% fee on credit purchases; model prices pass through without markup. Raised $40M in June 2025; client inference run-rate topped $100M. A strong fit for prototyping and projects where self-hosted infrastructure is overkill.

Portkey — built for teams with compliance requirements. PII redaction, jailbreak detection, audit trails, SSO. If your project demands that level of security governance, start here.

Helicone — open-source, laser-focused on observability. Gateway built in Rust with ~8ms P50 latency. Ships with response caching that trims costs on repeated requests. Works well alongside LiteLLM, not as a replacement.

LiteLLM wins on control: self-hosted, full configuration access, free. For a production application juggling multiple providers, it offers the best ratio of control to operational effort.

Where This Doesn’t Work

Multi-provider isn’t free. Flexibility costs something.

Prompt caching breaks on fallback. Anthropic and OpenAI cache prompts to speed up repeated calls. When a request falls over to a backup provider, the primary’s cache sits idle. Long system prompts take a noticeable hit on both latency and cost. Advanced setups use project-level affinity — requests from the same project stick to the same provider when possible.

Response consistency. Different models produce different text. In a chatbot, that’s fine. In a pipeline with a strict JSON schema, it’s a risk. DeepSeek might return "rating": 4.5, Gemini might return "rating": "4.5". You must validate outputs.

Additional infrastructure. LiteLLM is a server you need to run, monitor, and update. With a single provider, an API key is enough. With five providers through LiteLLM — a Docker container, Redis for rate limiting, monitoring. Operational complexity stacks up.

Debugging gets harder. “Request failed” — which provider? Which fallback level? What error? You have to log every step: provider, model, latency, status, attempt number. Skip that, and you’re debugging blind.

Not all APIs are equal. The OpenAI-compatible format covers /chat/completions. Provider-specific features — vision API, function calling formats, streaming with tool use — can behave differently through a proxy. Before adding a provider to a fallback chain, test the specific scenarios you care about.

Getting Started

If your application runs on a single provider today, switching to multi-provider doesn’t require a rewrite.

Step 1: LiteLLM proxy. Spin up a Docker container. Wire up your current provider. Point all calls at the proxy. Nothing changes yet — same provider, same results. But now every LLM call flows through one chokepoint you control.

Step 2: Add a second provider as fallback. Drop in DeepSeek or Gemini Flash as a second deployment with the same model_name. LiteLLM switches to it automatically on primary failure. Test it — kill the primary provider manually.

Step 3: Task-based routing. Audit your calls. Which tasks burn through tokens? Which are trivial? Move the cheap ones to a cheap model. Title generation, classification, data extraction — DeepSeek. Chat, reasoning — Gemini or Claude.

Step 4: Monitoring. Connect Langfuse or equivalent. Trace every call: provider, model, latency, cost. Set up alerts for degradation.

The whole process — zero to production multi-provider — takes a couple of days. The LiteLLM proxy goes up in 30 minutes. Adding a provider is one line in config. Where you’ll actually spend time: testing fallback scenarios and wiring up monitoring.

FAQ

How does LiteLLM handle streaming responses during a mid-stream provider failure?

LiteLLM cannot transparently retry a streaming response that has already started — once tokens are flowing to the client, a mid-stream failure surfaces as a broken stream, not a seamless fallback. The fallback mechanism only activates before the first token is sent. For resilient streaming, the practical pattern is to set aggressive timeout values (timeout: 10) to catch slow providers early, and implement client-side reconnect logic that replays the request from scratch on stream error. Alternatively, disable streaming for critical requests where consistency matters more than time-to-first-token.

What is the realistic P99 latency penalty of routing through LiteLLM proxy compared to direct provider calls?

LiteLLM’s own benchmarks show ~8ms P95 overhead for the proxy hop. At P99, real-world observations in production environments typically show 15–30ms overhead — mainly from connection pool management and JSON serialization of the request/response. This is negligible compared to LLM inference latency (typically 500ms–3s P50). Where overhead compounds is cold-start scenarios: when a LiteLLM Docker container restarts, the first few requests take 200–400ms extra while connection pools warm up.

Can you use provider-specific features (like Anthropic’s extended thinking or OpenAI’s o-series reasoning tokens) through LiteLLM?

Provider-specific parameters can be passed through LiteLLM using the extra_body parameter, which is forwarded as-is to the provider’s API. For Anthropic’s extended thinking, set extra_body={"thinking": {"type": "enabled", "budget_tokens": 5000}}. However, these parameters are not part of the unified OpenAI-compatible format, so they only work when the specific provider handles the request — if the request falls over to a fallback provider, the provider-specific parameters are ignored rather than causing an error. Test fallback behavior explicitly when using extended thinking or reasoning-mode models.