LLM Observability with Langfuse: Why You Need It and How to Set It Up

LLM calls in production need tracing, prompts need versioning, costs need tracking. Standard APM tools won’t help here: responses are non-deterministic, every call costs money, and a 200 OK says nothing about output quality.

Langfuse is an open-source platform (MIT) for LLM observability. Self-hosted with no limits, 22,000+ GitHub stars, 23 million SDK installs per month. This guide covers installation, tracing, prompt management, cost tracking, and evaluations.

LLM Observability: Four Components

If you’ve worked with APM (Application Performance Monitoring), LLM observability will feel familiar. But there are three important differences:

Non-determinism. The same prompt with temperature > 0 produces different outputs. You can’t simply compare expected vs actual — quality has to be evaluated statistically.

Per-call cost. An HTTP request to your own API costs fractions of a cent. A single GPT-4o call with 10,000 tokens of context costs 3-5 cents. With 1,000 daily users, the difference between a good and bad prompt is hundreds of dollars per month.

Output quality. A 200 OK tells you nothing about the response quality. The model can return grammatically correct but factually useless text. You need quality metrics, not just availability metrics.

Four components of LLM observability:

+---------------------------------------------------------+
|                  LLM Observability                       |
+----------+----------+--------------+--------------------+
| Tracing  |   Cost   |   Prompt     |  Evaluation        |
|          | Tracking |  Management  |                    |
+----------+----------+--------------+--------------------+
| What     | How much | Which prompt | How good           |
| happened | it cost  | is in prod   | is the output      |
+----------+----------+--------------+--------------------+

Langfuse covers all four. You can start with one (usually tracing) and add the rest as you grow.

Why Langfuse

There are five or six LLM observability platforms on the market. Here’s what sets Langfuse apart:

Open-source (MIT). Code on GitHub, self-hosted version with no limits. No vendor lock-in — you always have the source code and data.

Self-hosted for free. No caps on trace count. For small and mid-size teams, self-hosting on a single VM is cheaper than any SaaS.

Not tied to a framework. LangSmith is built for LangChain — it’s harder to use with other stacks. Langfuse works with any LLM provider: OpenAI, Anthropic, open-source models via LiteLLM, Vercel AI SDK.

Actively developed. Over the past year: SDK v3 on OpenTelemetry (native compatibility with any OTEL instrumentation), built-in MCP server (prompt management right from your IDE), observation-level evaluations, dataset versioning. Growth from 10,000 to 22,000 stars in 10 months.

Comparison (brief):

	Langfuse	LangSmith	Helicone
Open-source	MIT	No	Apache 2.0
Self-hosted	Free, no limits	Enterprise only	Yes
Framework lock-in	No	LangChain	OpenAI-first
Prompt management	Yes + MCP	Yes	Yes (beta)
Free tier (cloud)	50k obs/mo	5k traces/mo	100k req/mo

Installation in 15 Minutes: Self-Hosted

Minimum requirements: Docker and 2 GB of RAM. Langfuse bundles everything into a single docker-compose.

# docker-compose.yml
services:
  langfuse:
    image: langfuse/langfuse:2
    ports:
      - "3000:3000"
    environment:
      DATABASE_URL: postgresql://postgres:postgres@db:5432/langfuse
      NEXTAUTH_SECRET: your-secret-key-change-me
      SALT: your-salt-change-me
      NEXTAUTH_URL: http://localhost:3000
    depends_on:
      - db

  db:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: langfuse
    volumes:
      - langfuse_data:/var/lib/postgresql/data

volumes:
  langfuse_data:

docker compose up -d

Langfuse will be available at localhost:3000 within a minute. Create an account, a project, and copy the API keys.

First Traces via Python SDK

pip install langfuse

from langfuse import Langfuse

langfuse = Langfuse(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="http://localhost:3000"  # your self-hosted URL
)

# Create a trace
trace = langfuse.trace(name="chat-response", user_id="user-123")

# Log an LLM call
generation = trace.generation(
    name="gpt-4o-response",
    model="gpt-4o",
    input=[{"role": "user", "content": "Recommend cafes in Moscow"}],
    output="Here are some options...",
    usage={"input": 42, "output": 128},
)

langfuse.flush()

Open the Langfuse UI — the trace will appear in the list with model, token, and latency data.

Automatic Instrumentation via LiteLLM

Manual logging is for understanding the mechanics. In production, automation is better. If you use LiteLLM as a proxy to LLM providers, Langfuse hooks in with a single line:

import litellm
litellm.success_callback = ["langfuse"]

# Every call is automatically sent to Langfuse
response = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
    metadata={
        "trace_name": "chat-response",
        "trace_user_id": "user-123",
        "generation_name": "greeting",
        "tags": ["production", "chat"],
    }
)

LiteLLM sends everything to Langfuse: model, tokens, cost, latency, input/output. You write a regular LLM call, Langfuse fills up automatically.

For OpenAI SDK, there’s a drop-in replacement:

from langfuse.openai import openai

# Use it like the regular OpenAI SDK
# All calls are automatically traced
client = openai.OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
)

Tracing: What’s Inside an LLM Call

A trace in Langfuse is a tree of operations. At the top level is the trace (one user request). Inside are generations (LLM calls) and spans (intermediate steps: retrieval, preprocessing, postprocessing).

Trace: "generate-itinerary"
|
+-- Span: "validate-input" (12ms)
|
+-- Generation: "analyze-request" (GPT-4o, 340 tokens, $0.003)
|
+-- Span: "search-places" (Foursquare API, 800ms)
|
+-- Generation: "build-itinerary" (GPT-4o, 2100 tokens, $0.018)
|
+-- Generation: "validate-result" (GPT-4o-mini, 450 tokens, $0.001)

What you see in the UI for each generation:

Input/Output — the full prompt and response
Model — which model was used
Tokens — input, output, total
Cost — dollar amount (calculated automatically by model)
Latency — response time
Metadata — arbitrary fields (user ID, feature, session)

Three things tracing reveals immediately:

Pattern 1: hidden retries. Retry logic in code can call the model 2-3 times per user request. Without tracing, you only see the final response. With tracing — every call with its cost.

Pattern 2: model mismatch. GPT-4o-mini is expected for quick responses, but one endpoint pulls GPT-4o. Tracing shows the model per call — filtering by the model field immediately surfaces the discrepancy.

Pattern 3: latency bottleneck. In a chain of four LLM calls, one takes 80% of the time. Without spans, you only know the total. With spans — you see exactly where to optimize.

Prompt Management: Versioning Without Deploys

The standard approach: prompts are hardcoded. A change requires PR, code review, merge, deploy. A/B testing two prompts means two deploys.

Langfuse externalizes prompts: your code loads them by name through an API. Changing a prompt in the UI takes effect on the next request — no deploy needed.

# Load a prompt from Langfuse
prompt = langfuse.get_prompt("travel-assistant")

# Compile with variables
compiled = prompt.compile(
    destination="Moscow",
    preferences="vegetarian cuisine"
)

# Use in an LLM call
response = litellm.completion(
    model="gpt-4o",
    messages=compiled,
    metadata={"langfuse_prompt_name": "travel-assistant"}
)

Every prompt change creates a new version. Langfuse stores the full history: who changed what and when. Roll back with a single click.

Fallback Pattern

Langfuse is an external service. It can go down. The prompt is always needed.

try:
    prompt = langfuse.get_prompt("travel-assistant", label="production")
    messages = prompt.compile(destination=city)
except Exception:
    # If Langfuse is unavailable — fall back to hardcoded prompt
    messages = [
        {"role": "system", "content": FALLBACK_SYSTEM_PROMPT},
        {"role": "user", "content": f"Recommend cafes in {city}"},
    ]

This pattern is mandatory for production. Langfuse speeds up prompt iteration, but it shouldn’t be a single point of failure.

MCP Server: Prompts from Your IDE

Since November 2025, Langfuse supports MCP (Model Context Protocol). By connecting the MCP server to Claude Code, Cursor, or another AI assistant, you can read and edit prompts directly from your IDE.

{
  "mcpServers": {
    "langfuse": {
      "type": "http",
      "url": "https://your-langfuse.com/api/public/mcp",
      "headers": {
        "Authorization": "Basic base64(publicKey:secretKey)"
      }
    }
  }
}

Instead of switching between your IDE and a browser, the AI assistant sees the current prompts, suggests changes, and applies them via MCP — all without leaving the editor.

Cost Tracking: How Much Each Feature Costs

Langfuse automatically calculates the cost of each LLM call based on the model and token count. The built-in pricing table updates with every release (GPT-5.2, Claude Opus 4 — supported on launch day).

In the dashboard you see:

Total cost for a given period
Cost per trace — average cost of a single request
Cost per user — how much a specific user costs
Cost per model — cost distribution across models

Pattern: Per-Feature Cost Tracking

By adding tags or metadata to traces, you group costs by feature:

trace = langfuse.trace(
    name="itinerary-generation",
    tags=["feature:itinerary", "tier:premium"],
    user_id="user-123",
)

Filtering by the feature:itinerary tag in the dashboard shows what itinerary generation specifically costs. Chat costs go separately, recommendations separately, summarization separately.

Pattern: Syncing Costs to Your Database

The Langfuse API lets you programmatically retrieve costs:

# Get all traces for a user
traces = langfuse.fetch_traces(user_id="user-123")

# Sum the cost
total_cost = sum(t.total_cost or 0 for t in traces.data)

# Write to your database for limits, billing, analytics
db.update_user_spending(user_id="user-123", amount=total_cost)

Useful if you have per-user LLM spending limits (freemium, credits) or want to show users their consumption.

Evaluations: Output Quality in Numbers

Tracing shows what happened. Evaluations show how well it went.

LLM-as-a-Judge

Automated evaluation via LLM: one model scores the outputs of another. Langfuse supports this out of the box.

Configure it in the UI: pick an evaluator template (relevance, helpfulness, toxicity), select the target set of traces — Langfuse runs the evaluation on each one. The result is a score from 0 to 1 attached to the trace.

Observation-Level Evaluations (New in 2026)

Previously, scores could only be attached to the trace as a whole. But in a chain of four LLM calls, the problem might be in one specific step. Since February 2026, you can evaluate each observation (generation, span) individually.

Example: in a pipeline “analyze, search, generate, validate” the evaluator for generate checks factual accuracy, while the one for validate checks format compliance. Different evaluators for different steps.

Datasets: Regression Testing for Prompts

A dataset in Langfuse is a set of input/expected_output pairs. Changed a prompt — run the dataset, compare scores with the previous version. If quality dropped — roll back.

Since December 2025, Langfuse versions datasets: each change (adding, deleting, updating an item) creates a new version. You can run an experiment on a specific historical version for reproducibility.

# Create or update a dataset
langfuse.create_dataset(name="travel-queries")

langfuse.create_dataset_item(
    dataset_name="travel-queries",
    input={"query": "Cafes in central Moscow"},
    expected_output="A list of 5+ cafes with addresses",
)

# Run an experiment
dataset = langfuse.get_dataset("travel-queries")
for item in dataset.items:
    response = run_my_pipeline(item.input)
    item.link(
        trace_id=response.trace_id,
        run_name="prompt-v3-test",
    )

In the Langfuse UI, you’ll see a comparison of runs: prompt v2 vs v3, with scores per item.

Production Checklist

Self-Hosted vs Cloud

Criterion	Self-hosted	Cloud
Cost	Infrastructure only (~$10-20/mo VPS)	From $59/mo (Pro)
Data	Stays with you	On Langfuse servers
Maintenance	You update, back up, and monitor	All included
Limits	None	50k obs/mo (free), then by plan
Best for	Teams with DevOps, compliance requirements	Quick start, small teams

For most production projects, self-hosted is cheaper: $15/mo for a VPS vs $59+/mo for cloud. But self-hosted means you’re responsible for uptime and backups.

Monitoring Langfuse Itself

Langfuse is an external dependency for your LLM calls (prompt management). If it goes down, prompts don’t load. Two patterns:

Health check. /api/public/health returns the status. Add it to your monitoring (Zabbix, Uptime Robot, Grafana).
Fallback prompts. Every get_prompt() is wrapped in try/catch with a hardcoded fallback. Langfuse can be down — the app keeps running.

Retention and Cleanup

Traces take up space in PostgreSQL. 1,000 traces per day means roughly 1 GB per month (depending on prompt and response size). Set up automatic deletion of old traces:

-- Delete traces older than 90 days
DELETE FROM traces WHERE created_at < NOW() - INTERVAL '90 days';

Or use the built-in retention settings in Langfuse Cloud.

Minimum Production Setup

+-------------------------------------------+
|            Your Application                |
|                                            |
|  LLM call -> LiteLLM -> OpenAI/Anthropic  |
|       |                                    |
|       | metadata (trace_name,              |
|       |  user_id, tags)                    |
|       v                                    |
|  LiteLLM callback -> Langfuse             |
+-------------------------------------------+
         |
         v
+--------------------+
| Langfuse (Docker)  |
|                    |
| - Traces           |  <- automatic
| - Cost             |  <- automatic
| - Prompts          |  <- managed in UI
| - Evaluations      |  <- configured in UI
+--------------------+

Prompts are loaded from Langfuse on each call (with caching). Traces are sent asynchronously — they don’t block the main thread. Cost is calculated automatically by model and token count.

Conclusion

Langfuse covers four tasks with a single tool: tracing, cost tracking, prompt management, and evaluations. The self-hosted version is free and deploys in 15 minutes. Integration via LiteLLM or OpenAI drop-in takes one line of code.

The minimum path: hook up tracing to a single endpoint, look at the data after a week. The first call chain with a cost breakdown will make the value obvious.

Links:

Langfuse GitHub — source code and self-hosted setup
Langfuse Docs — documentation
LiteLLM + Langfuse — integration
Langfuse Changelog — all updates

FAQ

What is the actual storage overhead of keeping traces in PostgreSQL long-term?

A typical trace with a 1,000-token prompt and 500-token response takes roughly 8–12 KB in PostgreSQL, including indexes. At 1,000 traces/day that is about 300–400 MB/month. The official retention recommendation is 90 days, which puts steady-state storage around 1–1.5 GB — comfortably within a $15/month VPS. If prompt and response content is large (RAG with long contexts), multiply by 3–5x and plan accordingly.

Does Langfuse support multi-tenant setups where different teams have isolated data?

Yes. Langfuse has Organizations → Projects hierarchy. Each project has its own API keys, prompt namespace, and dataset isolation. A single self-hosted instance can serve multiple teams with full data separation at the project level — no cross-project data leakage. RBAC (role-based access control) is available in the cloud Pro plan and in the self-hosted Enterprise edition.

How does Langfuse’s prompt caching interact with the `cache_ttl_seconds` parameter?

cache_ttl_seconds controls client-side in-memory caching of the fetched prompt object — it prevents a round-trip to the Langfuse server on every LLM call. It does not affect provider-side prompt caching (Anthropic or OpenAI cache headers). The two mechanisms are independent: you can have a 5-minute Langfuse cache TTL while still benefiting from Anthropic’s prompt caching for the actual LLM call, as long as your application passes the correct cache-control headers downstream.