
In September, we introduced MCP-Atlas, a benchmark measuring how well LLMs handle real tool use via the Model Context Protocol. We showed that even top models failed nearly half of realistic multi-tool tasks. Today, we're open-sourcing the benchmark so you can measure this yourself. As agents increasingly rely on tools to act in the real world—querying databases, calling APIs, and coordinating multi-step workflows—evaluating tool use has become as important as evaluating reasoning or language generation.
Three artifacts, everything you need to run MCP-Atlas evaluations:
The remaining 500 tasks are held out as a private validation set to preserve leaderboard integrity and prevent overfitting.
MCP-Atlas is designed to measure performance on realistic, multi-step workflows that handle real-world tool use through the Model Context Protocol (MCP).
Real MCP servers: All tasks run against real MCP implementations. Agents must handle real schemas, real errors, and real API behavior, not simplified mocks.
Natural language prompts without tool names: Prompts describe goals in plain language and avoid naming tools or servers. Agents must discover what to use, not follow instructions.
Controlled tool exposure with distractors: Each task exposes a limited set of tools that includes both required tools and distractors. This tests whether agents can select and parameterize the right tools under noise.
Since September, we've evaluated additional models. The leaderboard has shifted significantly (see full results here):
Model Pass Rate Mean Coverage claude-opus-4-5-2025-1101 62.30% 78.5% gemini-3-pro-preview 54.10% 73.2% GPT-5 44.50% 61.75% claude-4-5-sonnet 43.80% 62.17% o3 43.60% 66.91% Claude Opus 4.1 40.90% 64.99% Claude Sonnet 4 35.60% 57.35% GLM 4.5 Air* 34.00% 60.59% Kimi K2 Instruct* 23.90% 50.41% Qwen3-235B-A22B* 12.00% 29.06% Gemini 2.5 Pro 8.80% 30.77% GPT-4o 7.20% 28.53% Gemini 2.5 Flash 3.40% 17.83% Llama 4 Maverick* 0.80% 13.03%
*Evaluations for these models were run using Fireworks AI for inference
Claude Opus 4.5 now leads at 62.3%, compared to GPT-5’s 44.5% at launch. But the broader pattern holds: even top models fail roughly 40% of tasks that require coordinating multiple tools.
Baseline results reveal consistent failure patterns across models:
MCP-Atlas is designed to make these behaviors measurable and comparable across agents.
You can use MCP-Atlas to:
We’re releasing MCP-Atlas to help establish a shared, realistic standard for evaluating tool-using agents. We welcome feedback and contributions as the benchmark evolves, and we’re especially interested in seeing how new planning, retrieval, and execution strategies perform under MCP-Atlas.