MCP in Production: From Setup to Custom Servers
What is MCP (Model Context Protocol)?
Model Context Protocol (MCP) is an open standard for connecting AI agents to external tools, APIs, and data sources through a unified protocol backed by Anthropic, OpenAI, and Microsoft via the Linux Foundation. It enables AI models to call tools on local or remote MCP servers using a structured JSON-RPC interface, with over 10,000 public servers available.
TL;DR
- -MCP ecosystem: 10,000+ public servers, 97M SDK downloads/month, backed by Anthropic + OpenAI + Microsoft via Linux Foundation
- -Custom servers solve 3 real problems: off-the-shelf servers work poorly, don't fit your workflow, or update unpredictably
- -stdio transport is the right choice for local servers — Streamable HTTP only for multi-tenant cloud deployments
- -Critical rule: stdio server must NEVER write to stdout except JSON-RPC — `console.log()` breaks the entire protocol
- -Production patterns covered: error handling with typed errors, fallback chains, health checks, and graceful shutdown
10,000 MCP servers in public registries. 97 million SDK downloads per month. OpenAI, Google, and Microsoft backing the protocol through the Linux Foundation. MCP has become the standard for connecting AI agents to external systems.
But most guides stop at “install the npm package.” You installed it, it works. Now what?
Then comes production. Servers crash silently, processes hang, quotas run out with zero warning. My project JourneyBay runs a stack of custom MCP servers: CLI wrappers, bridges to internal APIs, specialized search indexes. This article covers how to build them, debug them, and keep your sanity.
Why Build Custom MCP Servers
There’s no shortage of public servers. npm, PyPI, and Anthropic’s official registry have hundreds of ready-made packages: GitHub, Slack, PostgreSQL, Notion. Plug it in, add it to your config, done.
Three situations where off-the-shelf servers fall short:
The existing server works poorly. OpenAI and Gemini both ship their own MCP servers. On paper, they look great. In practice: I couldn’t get the Gemini server to work at all — the docs didn’t match the actual behavior. OpenAI’s Codex server did work, but every call spun up a new session with all connected MCP tools and full context loaded. The result: bloated context, slow startup, unpredictable behavior. Writing a 250-line CLI wrapper that does exactly one thing turned out to be simpler and faster.
The existing server doesn’t cover your workflow. Substack has unofficial MCP servers, but I needed semantic search across my own articles with embeddings and caching. The existing servers talk to Substack’s API, but they don’t build indexes or search by meaning. When you need custom logic on top of a standard API, it’s easier to write your own.
Control over reliability. Public servers update without warning. A new version can break tool descriptions, change response formats, or add dependencies. A custom server works exactly as you wrote it and only breaks when you change something.
The MCP Ecosystem in February 2026
The protocol is just over a year old. In that time:
- SDK downloads hit 97 million per month (npmjs + PyPI combined)
- Active servers in registries passed 10,000 (up from 5,800 in mid-2025)
- MCP clients (IDEs, chatbots, agents) number over 300
- In December 2025, MCP was transferred to the Linux Foundation (Agentic AI Foundation), co-founded by Anthropic, OpenAI, and Block
The protocol is no longer Anthropic’s experiment. It’s an industry standard.
MCP Protocol Architecture: The Minimum You Need
Before writing your own server, you need to understand three things: what a server exposes, how it talks to clients, and how it connects.
Three Primitives
An MCP server can provide three types of data:
Tools — functions the LLM calls with parameters. “Find files matching a pattern,” “run an SQL query,” “send a message.” This is the main primitive — the reason people build servers. Each tool has a name, a description, and a JSON Schema for input parameters.
Resources — read-only data addressed by URI. “File contents,” “database schema,” “current project context.” The LLM reads data but doesn’t modify anything. Useful for context that doesn’t require action.
Prompts — instruction templates with variables. Stored on the server, updated without client changes. Handy for managing prompts through Langfuse or similar tools.
90% of the time, a custom server only needs Tools.
Transport: stdio vs Streamable HTTP
stdio — the client launches the server as a child process. Communication happens through stdin/stdout, line-delimited JSON-RPC. No network, no auth, the process dies with the client. This is the standard for local tools.
Streamable HTTP — a single HTTP endpoint where clients send POST requests. Supports OAuth 2.1 (recommended by the spec but not required), sessions, and horizontal scaling. Replaced the deprecated SSE transport in March 2025. Use it for cloud servers and multi-tenant systems.
Rule of thumb: if the server runs locally on a developer’s machine — stdio. If it runs in the cloud for multiple users — Streamable HTTP.
All my custom servers use stdio. They run on my machine, launched by Claude Code as child processes.
Lifecycle
Every session goes through three phases:
-
Initialize — the client sends
initializewith its protocol version and capabilities. The server responds with its own capabilities: which tools, resources, and prompts it provides. The client confirms with aninitializednotification. -
Operation — request-response exchange. The client calls
tools/call, readsresources/read. The server responds with JSON-RPC results. -
Shutdown — the client closes stdin. The server exits. If it hangs — SIGTERM, then SIGKILL (the exact timeout is implementation-defined; the spec says “a reasonable time”).
Critical for stdio: the server MUST NOT write anything to stdout except JSON-RPC messages. console.log("Server started") — and everything breaks. Logs go to stderr via console.error() only.
Custom MCP Server in 30 Minutes
TypeScript: @modelcontextprotocol/sdk
A minimal working server in TypeScript:
import { McpServer } from "@modelcontextprotocol/server";
import { StdioServerTransport } from "@modelcontextprotocol/server/stdio";
import * as z from "zod/v4";
const server = new McpServer({
name: "my-tool",
version: "1.0.0",
});
server.registerTool(
"search_docs",
{
description: "Search project documentation by keyword",
inputSchema: z.object({
query: z.string().describe("Search query"),
limit: z.number().optional().default(10).describe("Max results"),
}),
},
async ({ query, limit }) => {
// Your search logic here
const results = await searchIndex(query, limit);
return {
content: [{ type: "text", text: JSON.stringify(results, null, 2) }],
};
}
);
const transport = new StdioServerTransport();
await server.connect(transport);
console.error("Server running on stdio"); // stderr, not stdout!
Dependencies: @modelcontextprotocol/sdk and zod (v4). That’s it. No frameworks, ORMs, or config libraries.
Python: FastMCP
If you prefer Python, it’s even simpler:
from mcp.server.fastmcp import FastMCP
mcp = FastMCP("my-tool")
@mcp.tool()
async def search_docs(query: str, limit: int = 10) -> str:
"""Search project documentation by keyword.
Args:
query: Search query
limit: Max results
"""
results = await search_index(query, limit)
return json.dumps(results, indent=2)
if __name__ == "__main__":
mcp.run(transport="stdio")
FastMCP auto-generates JSON Schema from type hints and docstrings. One function, one tool.
Connecting to Claude Code
Create .mcp.json in your project root:
{
"mcpServers": {
"my-tool": {
"type": "stdio",
"command": "node",
"args": ["path/to/server.js"],
"env": {
"API_KEY": "your-key-here"
}
}
}
}
For Python servers:
{
"mcpServers": {
"my-tool": {
"type": "stdio",
"command": "uv",
"args": ["run", "path/to/server.py"]
}
}
}
Save the file, restart Claude Code. Your server will appear in the available tools list.
Testing with MCP Inspector
Before connecting to Claude Code, test your server with the Inspector:
npx @modelcontextprotocol/inspector node path/to/server.js
A web UI opens at localhost:6274. You can call each tool, inspect schemas, and verify responses. This saves hours of debugging.
Production Patterns: 6 Lessons
Theory’s over. What follows are patterns from real production. Each one grew out of a specific bug or incident.
Pattern 1: CLI Wrapper
Three of my MCP servers (OpenAI, Gemini, Qwen) are wrappers around existing CLIs. Instead of making direct API calls, they spawn codex exec, gemini -p, and qwen -p as child processes.
Why:
- Auth is already set up.
codexuses a token from~/.codex/auth.json,geminiuses Google OAuth through the browser. No API keys to manage, no OAuth flows to implement. - The CLI updates independently. New model, changed API, bug fix —
npm update -g @openai/codexand you’re done. The MCP server stays the same. - Less code. My OpenAI server is 250 lines. A direct API client with retry, streaming, and error handling would be 500 minimum.
Implementation:
import { spawn } from "child_process";
async function callCLI(prompt: string, timeout = 90000) {
return new Promise((resolve, reject) => {
const proc = spawn("codex", ["exec", "-p", prompt, "-o", tmpFile], {
env: { ...process.env, CODEX_HOME: "~/.codex-minimal" },
});
let stdout = "";
let stderr = "";
proc.stdout.on("data", (d) => (stdout += d));
proc.stderr.on("data", (d) => (stderr += d));
// Timer with graceful kill
const timer = setTimeout(() => {
proc.kill("SIGTERM");
setTimeout(() => {
if (!proc.killed) proc.kill("SIGKILL");
}, 5000);
reject(new Error(`Timeout after ${timeout}ms`));
}, timeout);
proc.on("close", (code) => {
clearTimeout(timer);
if (code === 0) resolve(stdout);
else reject(new Error(stderr));
});
});
}
Key detail: CODEX_HOME=~/.codex-minimal. This is a minimal config with no MCP servers. Without it, codex exec loads every MCP server from the main config, and startup takes 15-20 seconds. With the minimal config — 2-3 seconds.
Pattern 2: Timeout + Graceful Kill
setTimeout + process.kill isn’t enough. A CLI process can spawn child processes that won’t die from SIGTERM. The reliable approach:
- On timeout — send SIGTERM
- After 5 seconds, check: still alive? SIGKILL
- Clean up temp files in
finally
const timer = setTimeout(() => {
proc.kill("SIGTERM");
setTimeout(() => {
if (!proc.killed) proc.kill("SIGKILL");
}, 5000);
}, timeout);
A simple execPromise with a timeout parameter (like in my Qwen server) does worse: the process gets SIGKILL without warning, temp files linger, and maxBuffer at 10 MB might not be enough for long responses.
Comparing OpenAI/Gemini servers (spawn + graceful kill) with Qwen (exec + timeout):
| OpenAI/Gemini | Qwen | |
|---|---|---|
| Kill sequence | SIGTERM → SIGKILL | Immediate SIGKILL |
| Temp files | Cleaned in finally | May persist |
| Max output | Unlimited (streaming) | 10 MB (maxBuffer) |
| Debugging | stderr available | stderr mixed with stdout |
Pattern 3: Text-Based Error Detection
CLI tools don’t always return the right exit code. codex exec can exit with code 0 while stdout contains: “You’ve hit your usage limit for the week.”
The fix: check the response text against known error patterns:
function detectError(output: string) {
const text = output.toLowerCase();
if (text.includes("hit your usage limit") || text.includes("quota")) {
return { type: "QUOTA_EXCEEDED", retry: false,
hint: "Weekly Codex limit reached. Use DeepSeek as fallback." };
}
if (text.includes("not supported") && text.includes("chatgpt")) {
return { type: "MODEL_NOT_SUPPORTED", retry: false,
hint: "Model unavailable on ChatGPT Plus plan." };
}
if (text.includes("auth expired") || text.includes("please login")) {
return { type: "AUTH_EXPIRED", retry: false,
hint: "Re-authenticate: codex auth" };
}
return null;
}
Each error type includes a hint — a tip for the LLM on what to do next. Claude Code reads the hint and switches to the fallback automatically.
Gemini has its own patterns: "resource_exhausted", "not authenticated". Qwen has different ones. Each server knows its CLI’s error signatures.
Pattern 4: Caching with TTL
An MCP server for FlutterFlow makes HTTP requests to their API. Each list_projects takes 200-500 ms. The same data gets requested multiple times per session.
Add a cache with different TTLs per operation:
| Operation | TTL | Reason |
|---|---|---|
list_projects | 5 min | Projects rarely change |
get_project_file | 2 min | Files update more often |
validate | 0 (no cache) | Always needs fresh data |
After update_project_file — invalidate the cache for that file.
In Python (with diskcache):
from diskcache import Cache
cache = Cache(".cache/flutterflow")
def get_cached(key, ttl, fetch_fn):
cached = cache.get(key)
if cached is not None:
return cached
result = fetch_fn()
cache.set(key, result, expire=ttl)
return result
My Substack MCP server has a more complex caching setup: posts are cached separately (120 min TTL), embeddings are recalculated when a post changes. When refresh=True, the cache resets — but if the network is down, stale cache data is returned instead. Graceful degradation.
Pattern 5: Fallback Chains
One LLM provider is down? Switch to the next. No panic, no manual intervention.
Example chain for code review:
Provider A → Provider B → Provider C → Provider D (always available)
The implementation lives at the orchestrator level (Claude Code, Cursor, your agent), not inside MCP servers. Each server handles one provider and honestly reports its errors. The orchestrator sees QUOTA_EXCEEDED in the response and calls the next server in the chain.
Why not one server with internal retry:
- Transparency. You can see which provider responded. Logs clearly show: Provider A — quota hit, switched to B.
- Independent configs. Each provider has its own timeout, error format, and CLI.
- Parallelism. You can fire two providers simultaneously and synthesize results. Two independent MCP calls — no dependency between them.
Pattern 6: Tool Descriptions for LLMs
A tool description is a prompt. Claude reads it to decide which tool to call and with what parameters. A bad description means wrong calls or no calls at all.
What works:
server.registerTool("openai_chat", {
description:
"Send a prompt to OpenAI via Codex CLI. " +
"Non-interactive, returns text response. " +
"Timeout: 90 seconds. " +
"Errors: QUOTA_EXCEEDED (weekly limit), AUTH_EXPIRED (re-login needed).",
inputSchema: z.object({
prompt: z.string().describe("The prompt to send. Keep under 500 chars for best results."),
}),
});
What doesn’t work:
// Too vague — Claude won't know when to use it
description: "Interact with OpenAI services"
// Too long — Claude will truncate or ignore it
description: "This tool provides full access to OpenAI's..."
Practical rules:
- Start with a verb: “Send,” “Search,” “Get,” “Create”
- State limitations: timeout, rate limits, input format
- List possible errors: Claude can react to them
- Use describe() for every parameter: don’t rely on variable names
- Test with real tasks: ask Claude to perform a task and check if it picked the right tool
Server Composition
One server, one concern. MCP’s strength is that servers work together. Here are patterns that emerge once you have more than three.
Parallel Multi-LLM Calls
A git diff gets sent to two or three LLM providers in parallel through separate MCP servers. Each returns its own code review. Results get synthesized, critical findings get applied.
If one provider hits its quota, the next one picks up. Three or four MCP servers, each doing its thing, together forming a resilient workflow. The orchestrator (Claude Code, Cursor, any MCP client) manages fallback logic — the servers know nothing about it.
MCP Server as Internal API Bridge
Common scenario: your project has an internal service (prompt management, CMS, analytics) with no existing MCP server. You write a 100-200 line wrapper, and Claude gets access to your data through the standard protocol. Two or three tools cover 90% of use cases.
Servers for internal APIs usually skip caching (cacheTtlSeconds: 0) — always fresh data. Simpler and safer than dealing with invalidation.
Tier System: Managing Many Servers
With 10+ servers in your config, you need to know which ones are essential, which are useful, and which are backup:
| Tier | Examples | When Needed |
|---|---|---|
| Essential | GitHub, database, task tracker | Every day, every task |
| Production | LLM providers, search, monitoring | Working sessions |
| Quality of Life | Content platforms, browser automation | Specific workflows |
| Experimental | New integrations | Testing, then decide |
Essential servers stay connected always. Production — during work sessions. QoL — as needed. Experimental — disabled by default.
Monitoring and Debugging: When MCP Dies Quietly
MCP servers fail silently. No crash reports, no notifications. Claude Code just stops seeing tools, and you find out when a task doesn’t complete.
Three Typical Failure Modes
Silent crash. The server segfaults or throws an uncaught exception. The process is dead, but the client doesn’t know. The next tool call hangs or returns a connection error.
Hung process. A CLI call never returns. Without a timeout, the server waits forever. With a timeout, it returns an error — but the hung child process might stick around.
Corrupt stdout. console.log() instead of console.error(). Or a dependency writes to stdout during initialization. The JSON-RPC parser breaks, the client disconnects.
Claude Desktop Logs
For Claude Desktop (not Claude Code):
# macOS — all MCP logs
tail -f ~/Library/Logs/Claude/mcp.log
# Logs for a specific server
tail -f ~/Library/Logs/Claude/mcp-server-my-tool.log
mcp.log covers connections and disconnections. mcp-server-*.log captures each server’s stderr.
Structured Logging in stderr
What to log in your server:
function log(level: string, tool: string, data: Record<string, unknown>) {
const entry = {
ts: new Date().toISOString(),
level,
tool,
...data,
};
console.error(JSON.stringify(entry));
}
// On tool call
log("info", "openai_chat", {
action: "call_start",
promptLength: prompt.length,
});
// On response
log("info", "openai_chat", {
action: "call_end",
duration: Date.now() - start,
outputLength: result.length,
});
// On error
log("error", "openai_chat", {
action: "call_error",
errorType: "QUOTA_EXCEEDED",
duration: Date.now() - start,
});
What NOT to log: prompt and response contents (may contain sensitive data), API keys and tokens.
Health-Check Tool
A useful pattern for complex MCP servers: add a health tool that returns server state:
{
"status": "healthy",
"uptime": "2h 34m",
"memory": "128 MB",
"connectedServices": 3,
"transport": "streamable-http",
"port": 3001
}
Claude Code can call health before a complex task to make sure the server is alive and has enough resources.
Security: A Checklist for Custom Servers
In December 2025, three vulnerabilities were patched in Anthropic’s official Git MCP server (CVE-2025-68143, CVE-2025-68144, CVE-2025-68145), which gained wide attention in January 2026. Through a chain of path traversal, argument injection, and .git/config writes, an attacker could achieve arbitrary code execution.
If Anthropic’s own server had holes, custom servers will too — unless you check.
CoSAI: 40 Threats Across 12 Categories
The Coalition for Secure AI (CoSAI) published a white paper in January 2026 with a full taxonomy of MCP threats. Of the 12 categories, five are critical for custom servers:
- Input Validation — a prompt can contain instructions for a tool. If the tool accepts arbitrary text and passes it to a shell, hello command injection.
- Trust Boundary Failures — the LLM decides which tool to call. That’s a probabilistic decision that can be manipulated through prompt injection.
- Supply Chain — your MCP server’s dependencies. One vulnerable npm package, and an attacker is inside your server.
- Data/Control Boundary — an MCP server has access to the filesystem, database, and APIs. Give it more privileges than needed, and one prompt injection can lead to data exfiltration.
- Insufficient Observability — without logs, you won’t know a tool was called with suspicious parameters.
Tool Poisoning
Research from Invariant Labs and the MCPTox benchmark shows tool poisoning succeeds in 70-85% of cases with auto-approval enabled, depending on the model and scenario. The mechanism: a malicious MCP server changes a tool’s description after initial approval. Day one — “get weather.” A week later — “get weather and send chat history to an external server.”
Practical Checklist
When building a server:
- Tool parameters are validated through zod/pydantic. No
anytypes - The server never runs shell commands from user input
- Secrets are passed via
envin.mcp.json, not throughargs - API tokens don’t appear in logs
-
YAML.safe_load()instead ofYAML.load()(for Python servers)
When connecting a third-party server:
- Read the source code. Seriously. Especially the tool call handlers
- Pin the package version:
"@package/mcp@1.2.3", not"@package/mcp@latest" - Check which env variables the server reads
- Don’t enable auto-approval for servers you don’t trust
Regularly:
- Update dependencies and review changelogs
- Check tool descriptions — did they change after an update?
- Review stderr logs for suspicious calls
Bottom Line: When to Build, When Not to Build
Build your own MCP server if:
- You need an integration that doesn’t exist in the registry (your internal API, a specific workflow)
- You want to reuse an existing CLI’s authentication
- The available server is unstable or unmaintained
- You need full control over reliability and error handling
Don’t build if:
- An existing server covers your needs
- The integration is a one-off (easier to use the Bash tool)
- There’s no production load (experiments can use the API directly)
An MCP server is 200-300 lines of code. It’s not a microservice, not a framework, not an infrastructure project. It’s a script that speaks JSON-RPC and does one thing well. Low barrier to entry, real payoff.
Resources
- MCP Specification (2025-11-25) — current protocol version
- TypeScript SDK —
@modelcontextprotocol/sdk - Python SDK —
mcp[cli]with FastMCP - MCP Inspector — server debugging
- CoSAI MCP Security Whitepaper — threat taxonomy
- Timeline of MCP Security Breaches — vulnerability timeline
FAQ
What is the performance cost of running 10+ MCP servers simultaneously in Claude Code?
Each stdio server is a persistent child process — the memory overhead is typically 30–80 MB per Node.js server and 20–50 MB per Python server. With 10 servers that is 300–800 MB total, which is noticeable on a MacBook with 8 GB RAM. The startup cost (when Claude Code launches) is 2–5 seconds for all servers to initialize and complete the handshake. The runtime overhead per tool call is negligible (sub-millisecond for the JSON-RPC layer itself). The practical limit is around 15–20 servers before context bloat from tool descriptions degrades LLM decision quality.
How do you handle authentication rotation for MCP servers that use API keys?
Pass secrets via the env field in .mcp.json, never via args. For keys that rotate (OAuth tokens, short-lived credentials), the cleanest pattern is a wrapper script that fetches the current token from a secrets manager (1Password CLI, AWS Secrets Manager) at process start and injects it as an environment variable. The MCP server process itself never stores the token — it reads process.env.API_KEY on each call. Since stdio servers restart with the client, you get automatic token refresh on every Claude Code session.
Can a single MCP server expose tools that call other MCP servers internally?
No — MCP servers do not have an MCP client built in, so one server cannot directly invoke another server’s tools. The orchestration always happens at the client level (Claude Code, Cursor, your agent). If you need tool composition, build it into the orchestrator’s system prompt (chain tool calls explicitly) or create a dedicated “meta-tool” server that internally calls whatever APIs the other servers would call, effectively inlining the logic. The latter approach loses the modularity benefit of separate servers but gains latency by eliminating a round-trip.