
In September, we introduced MCP-Atlas, a benchmark measuring how well LLMs handle real tool use via the Model Context Protocol. We showed that even top models failed nearly half of realistic multi-tool tasks. Today, we're open-sourcing the benchmark so you can measure this yourself. As agents increasingly rely on tools to act in the real world—querying databases, calling APIs, and coordinating multi-step workflows—evaluating tool use has become as important as evaluating reasoning or language generation.
Three artifacts, everything you need to run MCP-Atlas evaluations:
Research Paper: Full methodology and detailed analysis.
Public Dataset: 500 human-authored tasks on Hugging Face
Evaluation Environment: A containerized setup exposing real MCP servers.
The remaining 500 tasks are held out as a private validation set to preserve leaderboard integrity and prevent overfitting.
MCP-Atlas is designed to measure performance on realistic, multi-step workflows that handle real-world tool use through the Model Context Protocol (MCP).
Real MCP servers: All tasks run against real MCP implementations. Agents must handle real schemas, real errors, and real API behavior, not simplified mocks.
Natural language prompts without tool names: Prompts describe goals in plain language and avoid naming tools or servers. Agents must discover what to use, not follow instructions.
Controlled tool exposure with distractors: Each task exposes a limited set of tools that includes both required tools and distractors. This tests whether agents can select and parameterize the right tools under noise.
Since September, we've evaluated additional models. The leaderboard has shifted significantly (see full results here):
| Model | Pass Rate | Mean Coverage |
| claude-opus-4-5-2025-1101 | 62.30% | 78.5% |
| gemini-3-pro-preview | 54.10% | 73.2% |
| GPT-5 | 44.50% | 61.75% |
| claude-4-5-sonnet | 43.80% | 62.17% |
| o3 | 43.60% | 66.91% |
| Claude Opus 4.1 | 40.90% | 64.99% |
| Claude Sonnet 4 | 35.60% | 57.35% |
| GLM 4.5 Air* | 34.00% | 60.59% |
| Kimi K2 Instruct* | 23.90% | 50.41% |
| Qwen3-235B-A22B* | 12.00% | 29.06% |
| Gemini 2.5 Pro | 8.80% | 30.77% |
| GPT-4o | 7.20% | 28.53% |
| Gemini 2.5 Flash | 3.40% | 17.83% |
| Llama 4 Maverick* | 0.80% | 13.03% |
*Evaluations for these models were run using Fireworks AI for inference
Claude Opus 4.5 now leads at 62.3%, compared to GPT-5’s 44.5% at launch. But the broader pattern holds: even top models fail roughly 40% of tasks that require coordinating multiple tools.
Baseline results reveal consistent failure patterns across models:
Tool usage is the dominant bottleneck. Errors most often come from selecting the wrong tool, using incorrect parameters, or sequencing calls incorrectly.
Many failures involve not using tools at all. Agents frequently stop early or fail to recognize that tools are required to solve the task.
When tools are used correctly, synthesis is rarely the issue. The main challenges are discovery, parameterization, and completing multi-step workflows reliably.
MCP-Atlas is designed to make these behaviors measurable and comparable across agents.
You can use MCP-Atlas to:
Reproduce published baseline results
Evaluate your own MCP-capable agent
Compare models under identical tool conditions
Track regressions as you improve planning, tool retrieval, or execution
Read the paper
Explore the dataset on Hugging Face
Run the environment and harness on GitHub
View rankings on the leaderboard
We’re releasing MCP-Atlas to help establish a shared, realistic standard for evaluating tool-using agents. We welcome feedback and contributions as the benchmark evolves, and we’re especially interested in seeing how new planning, retrieval, and execution strategies perform under MCP-Atlas.