LLM Observability with Langfuse: Why You Need It and How to Set It Up
What is LLM observability?
LLM observability is the practice of monitoring, tracing, and evaluating AI language model calls in production across four dimensions: request tracing, cost tracking, prompt versioning, and output quality assessment. Unlike traditional APM, it accounts for non-deterministic responses and per-call costs that standard metrics like HTTP status codes cannot capture.
TL;DR
- -Langfuse is MIT-licensed, self-hosted with no trace limits, 22,000+ GitHub stars, 23M SDK installs/month
- -A single bad prompt vs. a good one can cost hundreds of dollars/month at 1,000 daily users — cost tracking is non-optional
- -Self-hosted setup takes 15 minutes with Docker Compose; minimum requirements are Docker and 2 GB RAM
- -SDK v3 is built on OpenTelemetry — native compatibility with any existing OTEL instrumentation, no lock-in
- -Covers all four LLM observability pillars: tracing, cost tracking, prompt management, and output evaluation
LLM calls in production need tracing, prompts need versioning, costs need tracking. Standard APM tools won’t help here: responses are non-deterministic, every call costs money, and a 200 OK says nothing about output quality.
Langfuse is an open-source platform (MIT) for LLM observability. Self-hosted with no limits, 22,000+ GitHub stars, 23 million SDK installs per month. This guide covers installation, tracing, prompt management, cost tracking, and evaluations.
LLM Observability: Four Components
If you’ve worked with APM (Application Performance Monitoring), LLM observability will feel familiar. But there are three important differences:
Non-determinism. The same prompt with temperature > 0 produces different outputs. You can’t simply compare expected vs actual — quality has to be evaluated statistically.
Per-call cost. An HTTP request to your own API costs fractions of a cent. A single GPT-4o call with 10,000 tokens of context costs 3-5 cents. With 1,000 daily users, the difference between a good and bad prompt is hundreds of dollars per month.
Output quality. A 200 OK tells you nothing about the response quality. The model can return grammatically correct but factually useless text. You need quality metrics, not just availability metrics.
Four components of LLM observability:
+---------------------------------------------------------+
| LLM Observability |
+----------+----------+--------------+--------------------+
| Tracing | Cost | Prompt | Evaluation |
| | Tracking | Management | |
+----------+----------+--------------+--------------------+
| What | How much | Which prompt | How good |
| happened | it cost | is in prod | is the output |
+----------+----------+--------------+--------------------+
Langfuse covers all four. You can start with one (usually tracing) and add the rest as you grow.
Why Langfuse
There are five or six LLM observability platforms on the market. Here’s what sets Langfuse apart:
Open-source (MIT). Code on GitHub, self-hosted version with no limits. No vendor lock-in — you always have the source code and data.
Self-hosted for free. No caps on trace count. For small and mid-size teams, self-hosting on a single VM is cheaper than any SaaS.
Not tied to a framework. LangSmith is built for LangChain — it’s harder to use with other stacks. Langfuse works with any LLM provider: OpenAI, Anthropic, open-source models via LiteLLM, Vercel AI SDK.
Actively developed. Over the past year: SDK v3 on OpenTelemetry (native compatibility with any OTEL instrumentation), built-in MCP server (prompt management right from your IDE), observation-level evaluations, dataset versioning. Growth from 10,000 to 22,000 stars in 10 months.
Comparison (brief):
| Langfuse | LangSmith | Helicone | |
|---|---|---|---|
| Open-source | MIT | No | Apache 2.0 |
| Self-hosted | Free, no limits | Enterprise only | Yes |
| Framework lock-in | No | LangChain | OpenAI-first |
| Prompt management | Yes + MCP | Yes | Yes (beta) |
| Free tier (cloud) | 50k obs/mo | 5k traces/mo | 100k req/mo |
Installation in 15 Minutes: Self-Hosted
Minimum requirements: Docker and 2 GB of RAM. Langfuse bundles everything into a single docker-compose.
# docker-compose.yml
services:
langfuse:
image: langfuse/langfuse:2
ports:
- "3000:3000"
environment:
DATABASE_URL: postgresql://postgres:postgres@db:5432/langfuse
NEXTAUTH_SECRET: your-secret-key-change-me
SALT: your-salt-change-me
NEXTAUTH_URL: http://localhost:3000
depends_on:
- db
db:
image: postgres:16
environment:
POSTGRES_PASSWORD: postgres
POSTGRES_DB: langfuse
volumes:
- langfuse_data:/var/lib/postgresql/data
volumes:
langfuse_data:
docker compose up -d
Langfuse will be available at localhost:3000 within a minute. Create an account, a project, and copy the API keys.
First Traces via Python SDK
pip install langfuse
from langfuse import Langfuse
langfuse = Langfuse(
public_key="pk-lf-...",
secret_key="sk-lf-...",
host="http://localhost:3000" # your self-hosted URL
)
# Create a trace
trace = langfuse.trace(name="chat-response", user_id="user-123")
# Log an LLM call
generation = trace.generation(
name="gpt-4o-response",
model="gpt-4o",
input=[{"role": "user", "content": "Recommend cafes in Moscow"}],
output="Here are some options...",
usage={"input": 42, "output": 128},
)
langfuse.flush()
Open the Langfuse UI — the trace will appear in the list with model, token, and latency data.
Automatic Instrumentation via LiteLLM
Manual logging is for understanding the mechanics. In production, automation is better. If you use LiteLLM as a proxy to LLM providers, Langfuse hooks in with a single line:
import litellm
litellm.success_callback = ["langfuse"]
# Every call is automatically sent to Langfuse
response = litellm.completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
metadata={
"trace_name": "chat-response",
"trace_user_id": "user-123",
"generation_name": "greeting",
"tags": ["production", "chat"],
}
)
LiteLLM sends everything to Langfuse: model, tokens, cost, latency, input/output. You write a regular LLM call, Langfuse fills up automatically.
For OpenAI SDK, there’s a drop-in replacement:
from langfuse.openai import openai
# Use it like the regular OpenAI SDK
# All calls are automatically traced
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)
Tracing: What’s Inside an LLM Call
A trace in Langfuse is a tree of operations. At the top level is the trace (one user request). Inside are generations (LLM calls) and spans (intermediate steps: retrieval, preprocessing, postprocessing).
Trace: "generate-itinerary"
|
+-- Span: "validate-input" (12ms)
|
+-- Generation: "analyze-request" (GPT-4o, 340 tokens, $0.003)
|
+-- Span: "search-places" (Foursquare API, 800ms)
|
+-- Generation: "build-itinerary" (GPT-4o, 2100 tokens, $0.018)
|
+-- Generation: "validate-result" (GPT-4o-mini, 450 tokens, $0.001)
What you see in the UI for each generation:
- Input/Output — the full prompt and response
- Model — which model was used
- Tokens — input, output, total
- Cost — dollar amount (calculated automatically by model)
- Latency — response time
- Metadata — arbitrary fields (user ID, feature, session)
Three things tracing reveals immediately:
Pattern 1: hidden retries. Retry logic in code can call the model 2-3 times per user request. Without tracing, you only see the final response. With tracing — every call with its cost.
Pattern 2: model mismatch. GPT-4o-mini is expected for quick responses, but one endpoint pulls GPT-4o. Tracing shows the model per call — filtering by the model field immediately surfaces the discrepancy.
Pattern 3: latency bottleneck. In a chain of four LLM calls, one takes 80% of the time. Without spans, you only know the total. With spans — you see exactly where to optimize.
Prompt Management: Versioning Without Deploys
The standard approach: prompts are hardcoded. A change requires PR, code review, merge, deploy. A/B testing two prompts means two deploys.
Langfuse externalizes prompts: your code loads them by name through an API. Changing a prompt in the UI takes effect on the next request — no deploy needed.
# Load a prompt from Langfuse
prompt = langfuse.get_prompt("travel-assistant")
# Compile with variables
compiled = prompt.compile(
destination="Moscow",
preferences="vegetarian cuisine"
)
# Use in an LLM call
response = litellm.completion(
model="gpt-4o",
messages=compiled,
metadata={"langfuse_prompt_name": "travel-assistant"}
)
Every prompt change creates a new version. Langfuse stores the full history: who changed what and when. Roll back with a single click.
Fallback Pattern
Langfuse is an external service. It can go down. The prompt is always needed.
try:
prompt = langfuse.get_prompt("travel-assistant", label="production")
messages = prompt.compile(destination=city)
except Exception:
# If Langfuse is unavailable — fall back to hardcoded prompt
messages = [
{"role": "system", "content": FALLBACK_SYSTEM_PROMPT},
{"role": "user", "content": f"Recommend cafes in {city}"},
]
This pattern is mandatory for production. Langfuse speeds up prompt iteration, but it shouldn’t be a single point of failure.
MCP Server: Prompts from Your IDE
Since November 2025, Langfuse supports MCP (Model Context Protocol). By connecting the MCP server to Claude Code, Cursor, or another AI assistant, you can read and edit prompts directly from your IDE.
{
"mcpServers": {
"langfuse": {
"type": "http",
"url": "https://your-langfuse.com/api/public/mcp",
"headers": {
"Authorization": "Basic base64(publicKey:secretKey)"
}
}
}
}
Instead of switching between your IDE and a browser, the AI assistant sees the current prompts, suggests changes, and applies them via MCP — all without leaving the editor.
Cost Tracking: How Much Each Feature Costs
Langfuse automatically calculates the cost of each LLM call based on the model and token count. The built-in pricing table updates with every release (GPT-5.2, Claude Opus 4 — supported on launch day).
In the dashboard you see:
- Total cost for a given period
- Cost per trace — average cost of a single request
- Cost per user — how much a specific user costs
- Cost per model — cost distribution across models
Pattern: Per-Feature Cost Tracking
By adding tags or metadata to traces, you group costs by feature:
trace = langfuse.trace(
name="itinerary-generation",
tags=["feature:itinerary", "tier:premium"],
user_id="user-123",
)
Filtering by the feature:itinerary tag in the dashboard shows what itinerary generation specifically costs. Chat costs go separately, recommendations separately, summarization separately.
Pattern: Syncing Costs to Your Database
The Langfuse API lets you programmatically retrieve costs:
# Get all traces for a user
traces = langfuse.fetch_traces(user_id="user-123")
# Sum the cost
total_cost = sum(t.total_cost or 0 for t in traces.data)
# Write to your database for limits, billing, analytics
db.update_user_spending(user_id="user-123", amount=total_cost)
Useful if you have per-user LLM spending limits (freemium, credits) or want to show users their consumption.
Evaluations: Output Quality in Numbers
Tracing shows what happened. Evaluations show how well it went.
LLM-as-a-Judge
Automated evaluation via LLM: one model scores the outputs of another. Langfuse supports this out of the box.
Configure it in the UI: pick an evaluator template (relevance, helpfulness, toxicity), select the target set of traces — Langfuse runs the evaluation on each one. The result is a score from 0 to 1 attached to the trace.
Observation-Level Evaluations (New in 2026)
Previously, scores could only be attached to the trace as a whole. But in a chain of four LLM calls, the problem might be in one specific step. Since February 2026, you can evaluate each observation (generation, span) individually.
Example: in a pipeline “analyze, search, generate, validate” the evaluator for generate checks factual accuracy, while the one for validate checks format compliance. Different evaluators for different steps.
Datasets: Regression Testing for Prompts
A dataset in Langfuse is a set of input/expected_output pairs. Changed a prompt — run the dataset, compare scores with the previous version. If quality dropped — roll back.
Since December 2025, Langfuse versions datasets: each change (adding, deleting, updating an item) creates a new version. You can run an experiment on a specific historical version for reproducibility.
# Create or update a dataset
langfuse.create_dataset(name="travel-queries")
langfuse.create_dataset_item(
dataset_name="travel-queries",
input={"query": "Cafes in central Moscow"},
expected_output="A list of 5+ cafes with addresses",
)
# Run an experiment
dataset = langfuse.get_dataset("travel-queries")
for item in dataset.items:
response = run_my_pipeline(item.input)
item.link(
trace_id=response.trace_id,
run_name="prompt-v3-test",
)
In the Langfuse UI, you’ll see a comparison of runs: prompt v2 vs v3, with scores per item.
Production Checklist
Self-Hosted vs Cloud
| Criterion | Self-hosted | Cloud |
|---|---|---|
| Cost | Infrastructure only (~$10-20/mo VPS) | From $59/mo (Pro) |
| Data | Stays with you | On Langfuse servers |
| Maintenance | You update, back up, and monitor | All included |
| Limits | None | 50k obs/mo (free), then by plan |
| Best for | Teams with DevOps, compliance requirements | Quick start, small teams |
For most production projects, self-hosted is cheaper: $15/mo for a VPS vs $59+/mo for cloud. But self-hosted means you’re responsible for uptime and backups.
Monitoring Langfuse Itself
Langfuse is an external dependency for your LLM calls (prompt management). If it goes down, prompts don’t load. Two patterns:
- Health check.
/api/public/healthreturns the status. Add it to your monitoring (Zabbix, Uptime Robot, Grafana). - Fallback prompts. Every
get_prompt()is wrapped in try/catch with a hardcoded fallback. Langfuse can be down — the app keeps running.
Retention and Cleanup
Traces take up space in PostgreSQL. 1,000 traces per day means roughly 1 GB per month (depending on prompt and response size). Set up automatic deletion of old traces:
-- Delete traces older than 90 days
DELETE FROM traces WHERE created_at < NOW() - INTERVAL '90 days';
Or use the built-in retention settings in Langfuse Cloud.
Minimum Production Setup
+-------------------------------------------+
| Your Application |
| |
| LLM call -> LiteLLM -> OpenAI/Anthropic |
| | |
| | metadata (trace_name, |
| | user_id, tags) |
| v |
| LiteLLM callback -> Langfuse |
+-------------------------------------------+
|
v
+--------------------+
| Langfuse (Docker) |
| |
| - Traces | <- automatic
| - Cost | <- automatic
| - Prompts | <- managed in UI
| - Evaluations | <- configured in UI
+--------------------+
Prompts are loaded from Langfuse on each call (with caching). Traces are sent asynchronously — they don’t block the main thread. Cost is calculated automatically by model and token count.
Conclusion
Langfuse covers four tasks with a single tool: tracing, cost tracking, prompt management, and evaluations. The self-hosted version is free and deploys in 15 minutes. Integration via LiteLLM or OpenAI drop-in takes one line of code.
The minimum path: hook up tracing to a single endpoint, look at the data after a week. The first call chain with a cost breakdown will make the value obvious.
Links:
- Langfuse GitHub — source code and self-hosted setup
- Langfuse Docs — documentation
- LiteLLM + Langfuse — integration
- Langfuse Changelog — all updates
FAQ
What is the actual storage overhead of keeping traces in PostgreSQL long-term?
A typical trace with a 1,000-token prompt and 500-token response takes roughly 8–12 KB in PostgreSQL, including indexes. At 1,000 traces/day that is about 300–400 MB/month. The official retention recommendation is 90 days, which puts steady-state storage around 1–1.5 GB — comfortably within a $15/month VPS. If prompt and response content is large (RAG with long contexts), multiply by 3–5x and plan accordingly.
Does Langfuse support multi-tenant setups where different teams have isolated data?
Yes. Langfuse has Organizations → Projects hierarchy. Each project has its own API keys, prompt namespace, and dataset isolation. A single self-hosted instance can serve multiple teams with full data separation at the project level — no cross-project data leakage. RBAC (role-based access control) is available in the cloud Pro plan and in the self-hosted Enterprise edition.
How does Langfuse’s prompt caching interact with the cache_ttl_seconds parameter?
cache_ttl_seconds controls client-side in-memory caching of the fetched prompt object — it prevents a round-trip to the Langfuse server on every LLM call. It does not affect provider-side prompt caching (Anthropic or OpenAI cache headers). The two mechanisms are independent: you can have a 5-minute Langfuse cache TTL while still benefiting from Anthropic’s prompt caching for the actual LLM call, as long as your application passes the correct cache-control headers downstream.