JourneyBay

Circuit Breaker in Deno Edge Functions: Protecting Your AI Pipeline from Cascade Failures

What is a circuit breaker pattern?

A circuit breaker is a resilience pattern that stops sending requests to a failing service after a threshold of errors is reached, allowing the system to recover instead of amplifying load. It operates in three states — CLOSED (normal), OPEN (blocking calls), and HALF-OPEN (testing recovery) — and is used to prevent cascade failures in distributed systems.

TL;DR

  • -12 min Foursquare outage killed JourneyBay's AI chat in 6 min — classic cascade failure from retry storm
  • -Deno edge functions lack shared state: each isolate retries independently, amplifying load on a failing API
  • -Circuit breaker uses 3 states (CLOSED/OPEN/HALF-OPEN) stored in isolate memory — zero Redis overhead
  • -Retry with exponential backoff + jitter prevents synchronized retry storms across parallel isolates
  • -The zero-dependency module eliminated 95% of cascade failure incidents in production

A user taps “Find cafes nearby.” The AI assistant calls Foursquare for a list of places, an LLM for personalized descriptions, and the database for preferences. Three external calls. Three points of failure.

Foursquare responds with a 503. The function waits for the timeout, retries, waits again. Another request asks for a route, a third opens the AI chat. Each edge function launches its own retry chain. Within seconds, a single API error snowballs into a flood of hanging requests and exhausted rate limits.

We caught this during load testing of JourneyBay. Foursquare’s API went down for 12 minutes. Without any protection, the retry storm blew through the LLM provider’s rate limits in four minutes. By minute six, the AI chat stopped responding entirely. Twelve minutes of one external API being down cascaded into everything else going dark — including services that had nothing to do with Foursquare.

A textbook cascade failure. And a textbook solution: resilience patterns. Except every guide out there describes them for Java (Resilience4j), Go, or persistent Node.js processes. For serverless TypeScript on Deno — where each request can land in a fresh cold isolate — there’s no playbook.

Here’s how we adapted three patterns for Deno Edge Functions: circuit breaker, retry with exponential backoff, and fail-open rate limiting. A compact, zero-dependency module that eliminated 95% of our cascade failure problems.

Why serverless breaks differently

Deno Edge Functions work differently: each request can land in a fresh V8 isolate. There’s no long-lived process, no guarantee that the next request will hit the same instance.

Three problems that don’t exist in traditional backends.

Retry storm. When a Java server retries a request, that’s one process, one retry chain. When 50 edge functions simultaneously retry a call to a dead API, that’s 50 independent chains. Each one is blind to the others. Each one dutifully waits through exponential backoff and fires again. Instead of recovering, the API gets DDoS’d by its own client.

No shared state. A circuit breaker relies on knowledge: “out of the last N requests, M have failed.” In a monolith, that information lives in process memory. In serverless, each isolate sees only its own requests. One isolate hits five errors in a row and should open the circuit. But the next request lands in a different isolate that knows nothing about any of it.

Cascade effect. An edge function calls another edge function over HTTP. That one calls a third. If one hangs, the whole chain hangs. And Deno’s fetch has no default timeout — a request can hang until the runtime kills it.

Off-the-shelf libraries like opossum were built for persistent processes. They store state in memory and assume the process sticks around. In a Deno isolate, it doesn’t. You need an implementation that accounts for the ephemeral runtime.

Circuit Breaker: a state machine in isolate memory

A circuit breaker works like an automatic breaker in an electrical panel. When the error rate crosses a threshold, the breaker trips and stops letting requests through. After a cooldown period, it lets one test request through to check if the service has recovered.

Three states:

CLOSED ──(failures >= threshold)──> OPEN ──(cooldown elapsed)──> HALF-OPEN
  ^                                                                  |
  |                                                                  |
  └──────────(successful test request)───────────────────────────────┘

HALF-OPEN ──(failure)──> OPEN   (trip again)

CLOSED — normal operation, requests flow through. The failure counter increments on each error. A successful request resets it to zero.

OPEN — the service is down, requests are rejected instantly. No waiting for timeouts, no wasting resources. The function immediately throws a CircuitBreakerOpenError.

HALF-OPEN — probe mode. One request is allowed through. If it succeeds, the circuit closes. If it fails, back to open.

State storage: the tradeoff

The key question: where do you store circuit breaker state when the isolate is ephemeral?

Two options:

In-memory (Map)External store (Redis)
Latency0 ms1-5 ms
SharedWithin one isolate onlyAcross all isolates
DurabilityLost on cold startPersistent
ComplexityMinimalRequires a Redis client

We chose in-memory. Deno isolates on Supabase are reused between requests as long as the instance stays warm. While the isolate is warm, state persists across requests. On a cold start the state resets, but that’s acceptable — if the isolate restarted, there’s a decent chance the external API has recovered too.

For critical scenarios that need cross-instance coordination, we use Redis. But for the circuit breaker, the per-request overhead isn’t worth it.

Implementation

State lives in a module-level Map:

const circuitRegistry = new Map<string, CircuitBreakerState>();

Each service gets its own circuit by name. Two CircuitBreaker.getOrCreate('foursquare') instances share the same state within an isolate:

export class CircuitBreaker {
  private readonly name: string;
  private readonly config: CircuitBreakerConfig;

  static getOrCreate(
    name: string,
    config: CircuitBreakerConfig
  ): CircuitBreaker {
    if (!circuitRegistry.has(name)) {
      circuitRegistry.set(name, {
        state: 'closed',
        failureCount: 0,
        successCount: 0,
      });
    }
    return new CircuitBreaker(name, config);
  }
}

The core of the circuit breaker is the call method, which wraps an operation:

async call<T>(operation: () => Promise<T>): Promise<T> {
  this.checkStateTransition();

  const currentState = this.getState();

  if (currentState.state === 'open') {
    throw new CircuitBreakerOpenError(
      this.name,
      this.remainingTimeoutMs
    );
  }

  try {
    const result = await operation();
    this.onSuccess();
    return result;
  } catch (error) {
    this.onFailure(error);
    throw error;
  }
}

When the circuit is open, it rejects the request without calling the operation. No timeout waits. No retries against a dead API.

Lazy state transition

Java circuit breaker implementations typically use a timer to transition from OPEN to HALF-OPEN. In serverless, you don’t need a timer. Instead, the check happens lazily on each call() invocation:

private checkStateTransition(): void {
  const { state, openedAt } = this.getState();

  if (state === 'open' && openedAt) {
    const elapsed = Date.now() - openedAt;
    if (elapsed >= this.config.resetTimeoutMs) {
      this.transitionTo('half-open');
    }
  }
}

If the cooldown period has passed, the circuit transitions to HALF-OPEN right when the next call comes in. No setInterval, no resource leaks.

4xx errors don’t trip the circuit

A common mistake: counting all errors as failures. But a 404 “Place not found” isn’t a Foursquare outage. It’s a perfectly normal response to a specific query. If the circuit opens because of 404s, all Foursquare requests get blocked even though the API is fully operational.

private shouldIgnoreError(error: unknown): boolean {
  const ignoredCodes = this.config.ignoredStatusCodes || [];

  if (error && typeof error === 'object') {
    const err = error as Record<string, unknown>;

    if (typeof err.statusCode === 'number'
        && ignoredCodes.includes(err.statusCode)) {
      return true;
    }

    if (typeof err.status === 'number'
        && ignoredCodes.includes(err.status)) {
      return true;
    }
  }

  return false;
}

Client errors (400, 401, 403, 404) are excluded from the failure count. Only server errors (5xx), timeouts, and network problems trip the circuit.

Profiles: one size doesn’t fit all

Different services fail differently and need different treatment. We created pre-configured profiles:

static forExternalApi(name: string): CircuitBreaker {
  return CircuitBreaker.getOrCreate(name, CONFIGS.EXTERNAL_API);
}

static forLLM(name: string): CircuitBreaker {
  return CircuitBreaker.getOrCreate(name, CONFIGS.LLM);
}

static forPayment(name: string): CircuitBreaker {
  return CircuitBreaker.getOrCreate(name, CONFIGS.PAYMENT);
}

External API (Foursquare, Google Places)

Tolerant. These APIs occasionally return 5xx on individual requests due to load balancing. We tolerate several failures before tripping and recover quickly. Client errors are fully ignored: 400, 401, 403, 404.

LLM (LightRAG, AI chat)

LLM services recover slowly, so this profile has a longer cooldown and a lower trip threshold. If the LLM starts throwing errors, it’s better to stop early and save tokens.

Payment (Tinkoff)

Low threshold, long cooldown. But the real difference is the retry strategy (more on that below). You can’t retry payments aggressively — a double charge is worse than a declined transaction. Fail fast, let the user try again.

Retry with jitter: how not to DDoS yourself

Unlimited retry is a DDoS on your own provider. Retry with a fixed interval is a coordinated DDoS — everyone retries at one second, then two, then four, in lockstep.

The solution: exponential backoff with jitter.

The formula

function calculateRetryDelay(
  attempt: number,
  config: RetryConfig
): number {
  const multiplier = config.backoffMultiplier ?? 2;

  // base * 2^(attempt-1): 1s, 2s, 4s, 8s...
  let delay = config.baseDelayMs * Math.pow(multiplier, attempt - 1);

  // Cap at maximum
  delay = Math.min(delay, config.maxDelayMs);

  // Jitter: random deviation (×0.5..1.5)
  if (config.jitter) {
    const jitterFactor = 0.5 + Math.random(); // 0.5 to 1.5
    delay = Math.round(delay * jitterFactor);
  }

  return delay;
}

The jitter multiplier ranges from 0.5 to 1.5 — the delay can be half as short or 50% longer than the base value. This is “proportional jitter.” AWS recommends full jitter (from 0 to the maximum), but we deliberately chose proportional: with full jitter, the delay can collapse to zero, effectively making the retry instant.

isRetryable: the error decides

Not every error is worth retrying. A 400 Bad Request won’t magically fix itself — the request body hasn’t changed. A 429 Rate Limit is worth retrying if you wait.

Instead of a sprawling switch/case, we introduced a contract: the error itself declares whether it can be retried.

export class TimeoutError extends Error {
  get isRetryable(): boolean {
    return true; // Timeout is a transient problem
  }
}

export class CircuitBreakerOpenError extends Error {
  get isRetryable(): boolean {
    return false; // Retrying is pointless, circuit is open
  }
}

The retry module checks multiple levels: HTTP status codes (429, 500, 502, 503, 504), string error codes ('TIMEOUT', 'RATE_LIMITED'), the isRetryable property, error class name, and message patterns. If any level says “retryable” — we retry. If none do — the error propagates immediately.

Strategies by criticality

A single retry strategy for everything is too blunt. You can’t retry a payment the same way you’d retry a place search.

We defined four profiles:

EXTERNAL_API — for Foursquare, Google Places. Moderate parameters: several attempts, delay starting from one second, jitter enabled. Retry codes: 429, 500, 502, 503, 504.

LLM — for LLM provider calls. Fewer attempts, longer initial delay (LLM services recover slowly). Jitter enabled.

CRITICAL — for payments and authentication. Minimal attempts, short delays, jitter disabled (deterministic behavior for auditing). Retry only on 503 and 504.

AGGRESSIVE — for idempotent operations (RAG indexing, cache updates). Many attempts, jitter enabled. Safe to retry because a repeated call won’t create duplicates.

How a client picks its strategy:

// Foursquare: moderate retry, reduced attempts (paid API)
return withRetry(operation, {
  ...RETRY_CONFIGS.EXTERNAL_API,
  maxAttempts: 2,
});

// LightRAG insert: idempotent operation, safe to retry aggressively
return withRetry(operation, RETRY_CONFIGS.AGGRESSIVE);

// Tinkoff payment: fail fast, minimal retries
return withRetry(operation, RETRY_CONFIGS.CRITICAL);

Foursquare deliberately reduces the attempt count from the default — every API call costs money. LightRAG insert uses aggressive retry — inserting the same document into the knowledge base is idempotent, so a duplicate write won’t break anything.

Timeout: AbortController over Promise.race

Deno’s fetch has no built-in timeout. Without an explicit limit, a request can hang until the platform kills it (Supabase’s request idle timeout is 150 seconds, the wall-clock limit goes up to 400). A single stuck call blocks the entire function.

Two implementations for different cases.

fetchWithTimeout: proper HTTP request cancellation

async function fetchWithTimeout(
  url: string | URL,
  options: RequestInit = {},
  timeoutMs: number = DEFAULT_TIMEOUTS.EXTERNAL_API
): Promise<Response> {
  const controller = new AbortController();
  const timeoutId = setTimeout(
    () => controller.abort(),
    timeoutMs
  );

  try {
    const response = await fetch(url, {
      ...options,
      signal: controller.signal,
    });
    return response;
  } catch (error) {
    if (error instanceof Error && error.name === 'AbortError') {
      throw new TimeoutError('fetch', timeoutMs);
    }
    throw error;
  } finally {
    clearTimeout(timeoutId);
  }
}

AbortController doesn’t just stop waiting — it tears down the TCP connection. Promise.race, by contrast, only stops listening: the fetch keeps hanging in the background, consuming isolate resources.

withTimeout: for non-fetch operations

For operations that don’t use fetch (database queries, internal calls), we use Promise.race:

async function withTimeout<T>(
  promise: Promise<T>,
  timeoutMs: number,
  operation = 'unknown'
): Promise<T> {
  let timeoutId: number | undefined;

  const timeoutPromise = new Promise<never>((_, reject) => {
    timeoutId = setTimeout(
      () => reject(new TimeoutError(operation, timeoutMs)),
      timeoutMs
    );
  });

  try {
    return await Promise.race([promise, timeoutPromise]);
  } finally {
    if (timeoutId !== undefined) {
      clearTimeout(timeoutId);
    }
  }
}

The finally block cleans up the timer. Without it, a successful operation would leave the timer alive in the isolate’s memory — a leak.

Timeouts by service type

TypeValueWhy
FAST3sGeocoding, cache lookups
DATABASE5sSupabase Postgres queries
EXTERNAL_API10sFoursquare, Google, Tinkoff
LLM60sText generation, AI chat

LLMs get a full minute: generating a long response takes 10-30 seconds, and under load the provider responds even slower. For geocoding, three seconds is enough — if Mapbox hasn’t answered in three seconds, it probably won’t.

TimeoutError is marked isRetryable = true. After a timeout, the retry module automatically retries if the strategy allows it.

Three layers combined: resilientFetch

Individual patterns work well, but together they cover every scenario. resilientFetch combines all three layers in a single call:

async function resilientFetch(
  url: string | URL,
  init?: RequestInit,
  config?: Partial<ResilientFetchConfig>
): Promise<Response> {
  const serviceName = config?.serviceName ?? 'unknown';

  // Build circuit breaker and retry config
  const breaker = CircuitBreaker.getOrCreate(serviceName, cbConfig);

  const operation = async (): Promise<Response> => {
    const response = await fetchWithTimeout(url, init, timeoutMs);

    if (response.status >= 500) {
      const error = new Error(`HTTP ${response.status}`);
      (error as any).statusCode = response.status;
      throw error;
    }

    return response;
  };

  const withCircuitBreaker = () => breaker.call(operation);

  return withRetry(withCircuitBreaker, retryConfig);
}

Layer order matters: retry on the outside, circuit breaker in the middle, timeout at the bottom.

  1. fetchWithTimeout caps the time for a single HTTP request
  2. circuitBreaker.call() wraps the fetch: if the circuit is open, it throws CircuitBreakerOpenError instantly
  3. withRetry wraps the circuit breaker: if the operation fails with a retryable error, it tries again

The critical detail: CircuitBreakerOpenError.isRetryable = false. When the circuit is open, retry doesn’t bother. The error bubbles up to the caller immediately. Without this linkage, retry could invoke the open circuit breaker multiple times, getting an instant error each time — a pointless waste.

And TimeoutError.isRetryable = true. If a request times out (the service might be overloaded), retry will try again after a backoff. But if the circuit has already opened by then (other requests failed too) — instant rejection.

One call instead of manually wiring three layers:

// Without resilientFetch
const response = await withRetry(
  () => circuitBreaker.call(
    () => fetchWithTimeout(url, options, 10000)
  ),
  RETRY_CONFIGS.EXTERNAL_API
);

// With resilientFetch
const response = await resilientFetch(url, options, {
  serviceName: 'foursquare',
});

Rate Limiting: fail-open design

Circuit breakers protect against cascade failures. Rate limiters protect against overload. Two different problems, two different tools.

Our rate limiter operates on three levels:

  1. IP RPM — per-IP limit for anonymous requests
  2. User RPM — requests per minute for authenticated users (value comes from their subscription plan)
  3. User TPM — tokens per minute for AI endpoints

We store User RPM and TPM limits in the database (the plans table); they vary by subscription tier. The IP limit is fixed and identical for all anonymous requests.

Redis as the store

Unlike the circuit breaker, the rate limiter needs to be shared: there’s no point limiting a user within one isolate if they can send requests to another. So here, Redis.

Keys follow this pattern:

rate_limit:ip:{ip}:rpm
rate_limit:user:{userId}:rpm
rate_limit:user:{userId}:tpm

The window is fixed (60 seconds), not sliding. Sliding is more precise but requires storing every timestamp. For rate limiting, a fixed window with INCR + EXPIRE is good enough.

Fail-open: Redis goes down, requests go through

The most important architectural decision in the rate limiter: what happens when Redis is unavailable?

Two options:

  • Fail-closed: Redis is down — block all requests. Safe, but users see errors even though the service works.
  • Fail-open: Redis is down — let all requests through. Risky (limits can be exceeded), but the service stays up.

We chose fail-open:

async checkIpLimit(ipAddress: string): Promise<RateLimitCheckResult> {
  try {
    const { exceeded } = await this.redis.checkLimit(key, limit, window);
    return { allowed: !exceeded, /* ... */ };
  } catch (error) {
    console.error('[RateLimiter] IP limit check failed:', error);
    // Redis unavailable — let the request through
    return { allowed: true, /* ... */ };
  }
}

The reasoning: rate limiting protects against abuse, not normal traffic. If Redis goes down for a minute, users might exceed their limit for 60 seconds at most. That’s unfortunate, but it beats a complete halt for everyone. Blocking legitimate users because of an infrastructure hiccup isn’t rate limiting — it’s a self-inflicted denial of service.

Fail-open applies at every level. If the IP check fails — let it through. If plan limits can’t be read from the database — let it through. If Redis doesn’t respond to the INCR — let it through.

Sliding window for client-side API limits

External APIs (Foursquare, Google Places) have their own rate limits. We don’t want to hit them and eat 429s. Instead, we throttle on the client side.

The client uses an in-memory sliding window — an array of timestamps:

class RateLimiter {
  private timestamps: number[] = [];

  async checkAndWait(): Promise<void> {
    const now = Date.now();

    // Remove requests outside the window
    this.timestamps = this.timestamps.filter(
      ts => now - ts < this.windowMs
    );

    if (this.timestamps.length >= this.maxRequests) {
      const oldestTs = this.timestamps[0];
      const waitTime = this.windowMs - (now - oldestTs) + 100;

      if (waitTime > 0) {
        await new Promise(r => setTimeout(r, waitTime));
      }
    }

    this.timestamps.push(Date.now());
  }
}

This isn’t Redis — it’s an in-memory limiter living inside the isolate. It doesn’t replace server-side rate limiting; it complements it. If the client knows Foursquare allows N requests per minute, it’s better to throttle locally than to hit 429s and burn retries.

How it all comes together: client example

The Foursquare client puts all three patterns together:

export class FoursquareClient {
  private readonly rateLimiter: RateLimiter;
  private readonly circuitBreaker: CircuitBreaker;

  constructor() {
    this.rateLimiter = new RateLimiter(MAX_REQUESTS_PER_MINUTE);
    this.circuitBreaker = CircuitBreaker.forExternalApi('foursquare');
  }

  private async request<T>(
    endpoint: string,
    params?: Record<string, string>
  ): Promise<T> {
    // 1. Local rate limit — don't exceed the API's limit
    await this.rateLimiter.checkAndWait();

    // 2. Circuit breaker — fail fast if API is down
    return this.circuitBreaker.call(async () => {

      // 3. Retry with backoff — retry on transient errors
      return withRetry(async () => {

        // 4. Timeout — don't wait forever
        const response = await fetchWithTimeout(
          url,
          { headers: { Authorization: `Bearer ${apiKey}` } },
          DEFAULT_TIMEOUTS.EXTERNAL_API
        );

        if (!response.ok) {
          this.handleError(response);
        }

        return response.json();
      }, {
        ...RETRY_CONFIGS.EXTERNAL_API,
        maxAttempts: 2,
      });
    });
  }
}

The request sequence:

  1. rateLimiter.checkAndWait() — if over the limit, wait
  2. circuitBreaker.call() — if the circuit is open, instant error
  3. withRetry() — if the request fails with a retryable error, retry with backoff
  4. fetchWithTimeout() — if no response within N seconds, timeout

On error, handleError throws typed exceptions. A FoursquareException with code 'RATE_LIMITED' or 'SERVER_ERROR' is marked isRetryable = true. With 'NOT_FOUND' or 'BAD_REQUEST' — no retry. And thanks to ignoredStatusCodes, they don’t trip the circuit breaker.

What we learned

In-memory state in isolates works better than expected. Under stable load, the isolate stays warm, and the circuit breaker accumulates state. Traffic spikes cause Supabase to spin up multiple isolates, each with its own state — but that’s fine. Worst case: a few extra requests to an already-dead API before the circuit trips.

Our first version of the circuit breaker counted all errors equally. The circuit would open when users searched for nonexistent places (404) or sent malformed requests (400). We had to distinguish “the service is broken” from “the client sent a bad request.”

Jitter is critical under high concurrency. Without it, we saw spikes: all retries landed at the same time, creating mini-overloads every 2-4 seconds. With jitter, the load spreads evenly.

Fail-open felt wrong at first. Your instinct is to lock everything down when things go wrong. But in serverless, “lock down” means rejecting all users. Fail-open allows a temporary limit overshoot but keeps the service running.

Every circuit breaker transition gets logged: closed → open, open → half-open, half-open → closed. Monitoring these transitions tells you more than monitoring errors. You see when the service started degrading, how fast it recovered, how many times it bounced.

The entire resilience module — circuit breaker, retry, timeout, types — is four files with zero external dependencies. Deno gives you native AbortController, native fetch, native setTimeout. That’s all you need for a full implementation running across 70+ edge functions.

Wrapping up

Cascade failures in serverless are routine. One dead API plus a dozen parallel retries takes down the whole system. Circuit breaker, retry with jitter, and fail-open rate limiting solve this at the architecture level — not through heroic on-call shifts.

Resilience patterns from the Java world work in Deno if you account for ephemeral isolates. In-memory state for the circuit breaker. Lazy transitions instead of timers. Fail-open instead of fail-closed. An isRetryable contract instead of hardcoded status codes.

After implementation, the same scenario on a load test: a twelve-minute Foursquare outage means one minute of an open circuit and instant failures instead of hanging requests. Instead of an endless spinner — “Service temporarily unavailable.” The AI chat keeps working because its LLM circuit is independent of the Foursquare circuit. The rest of the functions don’t even notice.

Four TypeScript files. Zero dependencies. Instead of 30 minutes of degradation — one minute of an open circuit and clean fail-fast.

FAQ

How many edge function instances can share a single circuit breaker state simultaneously?

None — in-memory state is scoped to one Deno isolate. Supabase may spin up 5, 10, or 50 parallel instances under load, and each starts with its own fresh circuit. In practice this means a failing API receives a short burst of extra requests before each new isolate trips its own circuit, typically within 3–5 errors per instance. The tradeoff is acceptable because the alternative (per-request Redis lookups at 1–5 ms overhead) adds latency to every call even when nothing is failing.

Why does the CRITICAL retry profile disable jitter for payment operations?

Jitter introduces non-determinism into the retry timeline, which complicates auditing and reconciliation. With jitter enabled, you cannot predict exactly when the second attempt fires — logs show retry at “~2–4 seconds” instead of a precise “2.0 seconds.” For payment systems, audit trails must be reproducible, and a deterministic retry sequence (1 s, 2 s) is easier to correlate with bank transaction logs. Disabling jitter is a deliberate tradeoff of load-distribution benefits for auditability.

What is the practical difference between monitoring circuit breaker transitions versus monitoring raw error rates?

Error rate spikes tell you something is broken right now. Transition logs tell you a timeline: when degradation started (closed → open), how long recovery took (open → half-open), and whether the service stabilized or bounced (half-open → open repeated). In JourneyBay’s Foursquare incident, transition logs showed the circuit opened 47 seconds before error rate crossed the alerting threshold — giving a 47-second head start on incident response that raw error monitoring would have missed.