Tool Reviews

Claude Concilium: Get a Second Opinion from GPT and Gemini Inside Claude Code

What is Claude Concilium?

Claude Concilium is an open-source multi-agent framework that runs parallel consultations with GPT, Gemini, Qwen, and DeepSeek directly from Claude Code via three MCP servers. It enables developers to get simultaneous second opinions from multiple LLMs without API keys, using a three-iteration protocol to reach consensus.

TL;DR

  • -Claude Concilium runs parallel consultations with GPT, Gemini, and Qwen via 3 MCP servers — no API keys needed
  • -Gemini CLI offers 1,000 free requests/day; OpenAI via ChatGPT Plus uses weekly credits — both zero cost
  • -Fallback chains auto-switch providers on quota/timeout: OpenAI → Qwen → DeepSeek
  • -The 3-iteration protocol (gather → resolve disagreements → final consensus) prevents blind agreement between models
  • -Best for: bugs stuck after 3 attempts, architectural decisions, and high-stakes code review

Claude Code is working on a task. First attempt, second, third. Tests fail. The bug reproduces. The architectural decision looks fine, but something’s off. One brain has hit a wall.

If you write code with AI assistants, you’ve been there. Claude Code handles most tasks well, but every model has blind spots. A race condition it keeps missing. An optimization approach that seemed right three iterations ago. An architectural choice that needs validation.

Claude Concilium fixes this with parallel consultations across multiple LLMs. You ask a question, get simultaneous answers from OpenAI and Gemini, compare them, find consensus. All through standard MCP protocol, without leaving Claude Code.

Why: One Brain vs Several

Each LLM has its strengths. GPT is good at short, focused tasks. Gemini offers a 1M token context window for analyzing large diffs. Qwen works well as a backup. When two agents independently point to the same problem, you trust that conclusion more.

In practice, the multi-agent approach helps in four scenarios:

  1. A bug won’t budge after three attempts. Claude Code is going in circles. A fresh perspective from another model often catches what the first one missed.
  2. Architectural decisions. Two approaches look equally valid. A second opinion adds arguments you hadn’t considered.
  3. Code review after a fix. Independent verification catches edge cases the author overlooked.
  4. High-stakes optimization. When the cost of a mistake is high, spending 30 seconds on a second query beats an hour of debugging.

How Concilium Works

Three custom MCP servers wrap provider CLI tools:

Claude Code ──┬── mcp-openai  (codex exec) ──► GPT
              ├── mcp-gemini  (gemini -p)  ──► Gemini
              └── mcp-qwen    (qwen -p)    ──► Qwen

Each server is a standalone Node.js process that accepts MCP calls and translates them into CLI commands. Why CLI instead of direct API calls? Because the CLIs handle OAuth themselves: codex login authenticates through your ChatGPT account, gemini through Google. No API keys needed for the two primary providers.

A fourth provider, DeepSeek, plugs in through the existing deepseek-mcp-server npm package. It needs an API key, but it’s cheap and always available.

Fallback Chains

Free tiers have limits. OpenAI on ChatGPT Plus gives weekly credits, Gemini CLI gives up to 1,000 requests per day. When a provider hits its limit, the server returns a structured QUOTA_EXCEEDED error, and the orchestrator switches to the next provider in the chain:

OpenAI ──► (QUOTA?) ──► Qwen ──► (timeout?) ──► DeepSeek
Gemini ──► (QUOTA?) ──► Qwen ──► (timeout?) ──► DeepSeek

Each MCP server detects errors by pattern-matching CLI output and returns a specific type: quota, auth, unsupported model, timeout. Each error type maps to a clear action: quota means switch providers, auth means re-authenticate, timeout means the process is already killed — move on.

Process Safety

Every server launches CLI tools via spawn() (not exec()), which prevents shell injection. The server passes prompts as arguments, never interpolating them into shell strings.

Timeouts use a SIGTERM/SIGKILL pattern: a graceful signal first, forced kill after 5 seconds. Output buffers cap at 10 MB — a hanging process can’t eat all memory.

const timer = setTimeout(() => {
  proc.kill("SIGTERM");
  killTimer = setTimeout(() => {
    if (!proc.killed) proc.kill("SIGKILL");
  }, 5000);
}, timeoutMs);

The Consultation Protocol

The /ai-concilium skill defines the workflow:

Iteration 1: gather opinions. State the problem concisely (under 500 chars), send it to OpenAI and Gemini in parallel. Compare: what do they agree on (high confidence), where do they disagree (needs clarification), what one model caught that the other missed.

Iteration 2: resolve disagreements. If agents gave opposite recommendations, send a follow-up to both: “Agent A suggested X, Agent B suggested Y. Which approach fits our context better?”

Iteration 3: final consensus. For critical decisions: “Here’s the final plan. Any concerns?” That’s it.

In practice, 80% of consultations end after the first iteration. The agents agree, and you move on. The remaining 20% need one round of clarification.

Example: Code Review

Here’s a real one. You fixed a race condition in an edge function. You need to verify the fix is correct.

# Two MCP calls, in parallel:

mcp__openai__openai_chat:
  prompt: "Code review: Fixed race condition in edge function.
    Changed from parallel writes to sequential with lock.
    Check: 1) fix correct? 2) new issues? 3) edge cases?"
  timeout: 90

mcp__gemini__gemini_chat:
  prompt: "Senior code reviewer. Review this diff:
    [diff]
    Focus on reliability, error handling, race conditions.
    Verdict: APPROVE or REQUEST_CHANGES."
  timeout: 90

OpenAI responds in 15 seconds: “Fix is correct, but consider the case where the lock isn’t released on exception.” Gemini in 20 seconds: “APPROVE with note: add try/finally to guarantee lock release.” Both flagged the same thing. The fix is obvious.

Quick Start

Setup takes a minute:

git clone https://github.com/spyrae/claude-concilium.git
cd claude-concilium

cd servers/mcp-openai && npm install && cd ../..
cd servers/mcp-gemini && npm install && cd ../..
cd servers/mcp-qwen && npm install && cd ../..

# Verify (no CLI tools required for this step)
node test/smoke-test.mjs

If all three servers report PASS, add them to your .mcp.json:

{
  "mcpServers": {
    "mcp-openai": {
      "type": "stdio",
      "command": "node",
      "args": ["/path/to/servers/mcp-openai/server.js"],
      "env": { "CODEX_HOME": "~/.codex-minimal" }
    },
    "mcp-gemini": {
      "type": "stdio",
      "command": "node",
      "args": ["/path/to/servers/mcp-gemini/server.js"]
    }
  }
}

For authentication:

  • OpenAI: codex login (requires ChatGPT Plus subscription)
  • Gemini: run gemini in your terminal, sign in with Google

Optionally, copy the skill for Claude Code:

cp skill/ai-concilium.md ~/.claude/commands/ai-concilium.md

Now /ai-concilium is available as a command inside Claude Code.

What’s Inside Each Server

mcp-openai

Two tools:

  • openai_chat — send a prompt to GPT via codex exec. 90-second timeout.
  • openai_review — code review via codex review, works with git diff. 120-second timeout.

Uses CODEX_HOME for a separate minimal config (skips loading MCP servers inside codex, which speeds up startup).

mcp-gemini

Two tools:

  • gemini_chat — quick Q&A via gemini -p. 90-second timeout.
  • gemini_analyze — deep analysis with up to 1M token context. 180-second timeout.

Works through Google OAuth, no API key needed. Gemini CLI provides up to 1,000 requests per day for free (on the Flash model; Pro limits are lower).

mcp-qwen

One tool:

  • qwen_chat — prompt via qwen -p. Three models: qwen-turbo (fast), qwen-plus (deep analysis), qwen-long (large context).

Works through DashScope (API key or qwen login).

Where It’s Used Today

Concilium started as a workflow inside a production project. In the JourneyBay project (a travel app built with Flutter + Supabase), AI Concilium runs automatically:

  • after every fix — parallel code review through OpenAI and Gemini;
  • after a third failed attempt to solve a bug — mandatory escalation.

In our experience, multi-agent review catches roughly 15-20% of issues that the primary agent missed. Most often these are edge cases in error handling and subtle race conditions.

Limitations

  • CLI dependency. Hard requirement. If codex or gemini aren’t installed or break after an update, the server won’t work. The smoke test verifies MCP protocol compliance, not CLI availability.
  • Free tier limits. ChatGPT Plus gives limited weekly credits. Gemini CLI allows up to 1,000 requests per day. Heavy use will burn through these.
  • Latency. Each request means spawning a process, waiting for the LLM provider, parsing output. 10 to 60 seconds. Overkill for quick tasks.
  • No cross-call context. Each MCP call is independent. The server doesn’t store conversation history.

Repository

github.com/spyrae/claude-concilium — released under MIT, three MCP servers, smoke tests, docs, and a Claude Code skill.

Each server works standalone. You don’t need the full framework to get MCP access to Gemini or OpenAI from Claude Code.

FAQ

How does Gemini’s 1M token context window change the economics of multi-agent review compared to GPT-4o?

For diffing large pull requests (10,000+ lines), Gemini’s gemini_analyze tool with a 180-second timeout can ingest the entire codebase context in one call, while GPT-4o requires chunking at ~128K tokens and multiple sequential calls. In practice, Gemini costs less per token on large inputs and processes the full context holistically — which reduces the chance of missing cross-file issues. GPT-4o’s advantage is response consistency and shorter latency on focused, well-scoped prompts under 10K tokens.

What happens to the consultation queue when both OpenAI and Gemini hit their daily limits simultaneously?

Both chains fall through to Qwen and then DeepSeek as defined in the fallback configuration. DeepSeek requires an API key but has no daily free-tier cap, making it the reliable last resort. If DeepSeek is also unavailable (network error, invalid key), the orchestrator surfaces a structured error listing which providers failed and why — the session doesn’t hang silently. In production JourneyBay usage, simultaneous quota exhaustion on both primary providers has occurred roughly twice per month during high-velocity coding sessions.

Why does the 3-iteration protocol specifically prevent “blind agreement” between models, rather than just averaging their responses?

Models trained on similar data distributions tend to converge on the same confident-sounding wrong answer when given the same ambiguous prompt — a phenomenon sometimes called “model echo chamber.” The protocol’s second iteration explicitly surfaces disagreements and asks each model to defend its position against the other’s argument. This adversarial framing forces each model to reason about the specific counterpoint rather than independently generating the same cached pattern. In internal testing, skipping iteration 2 reduced catch rate of subtle race conditions by approximately 40%.