62.3% Top Pass Rate | 1,000 Tasks (500 public + 500 held-out) | 36 MCP Servers | 220 Tools | 3–6 Tool Calls/Task
MCP-Atlas evaluates how well language models handle real-world tool use through the Model Context Protocol (MCP). Unlike benchmarks that focus on simple function calling, small tool sets, or simulated APIs, MCP-Atlas measures performance on realistic, multi-step workflows where models must:
discover the right tools from a noisy tool menu,
call tools with correct parameters and types,
recover from errors,
and synthesize tool outputs into an accurate final answer.
This benchmark targets a core deployment gap: models can be strong at reasoning and conversation, but still fail at reliable end-to-end tool use. Even the best-performing models still fail a large fraction of tasks, leaving meaningful headroom for improvement. The Scale Research team published the paper, the dataset on HuggingFace, and the code repository on GitHub.
This leaderboard is part of the open-source MCP-Atlas release:
Paper: benchmark design, scoring methodology, and baseline results
Dataset (Hugging Face): 500-task public subset
Environment + harness (GitHub): containerized setup that runs the real MCP servers and enforces controlled tool exposure
The other 500 tasks are held out to preserve leaderboard integrity and reduce overfitting.
The MCP-Atlas dataset consists of 1,000 human-authored tasks, spanning 36 real MCP servers and 220 tools. The public leaderboard subset contains 500 tasks, designed to be representative of the full benchmark. Each task is written to require real tool use and to reflect realistic “agent” workflows:
3–6 tool calls per task
Cross-server orchestration is required for the vast majority of tasks
Approximately one-third of tasks include conditional branching, where later actions depend on earlier tool outputs
The evaluation covers MCP servers across several broad buckets:
Category | Representative servers (examples) | Description |
|---|---|---|
ENV-BASIC |
| Search, fetch data, maps, and weather. |
ENV-ANALYTICS |
| Query and aggregate structured data. |
ENV-PRODUCTIVITY |
| Work with files, notes, email, and docs. |
ENV-FINANCIAL |
| Manipulate financial data. |
ENV-CODING |
| Manage repos and run code. |
To prevent brute-force tool calling and to stress tool discovery, each task exposes a limited tool surface:
10–25 tools exposed per task
typically 3–7 required tools
plus 5–10 plausible distractors
Distractors are typically sampled from the same servers to ensure that success depends on accurate discovery and parameterization rather than simple server name recognition. This directly tests tool discovery capabilities that prior work identifies as a primary failure mode.
Tasks run against real MCP servers hosted in Docker containers, not mocked endpoints or simulations. This ensures models face authentic challenges including actual API latency, real error messages, and genuine data formats.
Infrastructure:
Containerized environment with the benchmark’s MCP servers
Per-task controlled tool exposure (targets + distractors)
Execution trace logging to support diagnostic analysis
State and Time-Invariance:
For stateful tools (e.g., docs/databases/files), the environment is seeded with fixed datasets so tasks remain time-invariant and reproducible.
Every task is designed to be solvable only through tool use:
Tool-dependent: cannot be answered correctly from parametric knowledge alone
Time-invariant: stable ground-truth answers over time
Unambiguous: a single correct outcome, verifiable by claims
A typical task might ask:
“Look up Microsoft's IPO price in 1986, find the IPO prices of Apple, Amazon, and Google, then calculate which company had the lowest IPO price and by what percentage it was lower than the highest.”
This requires data retrieval via financial tools and search engines followed by arithmetic computation – all common in real applications but challenging to orchestrate correctly.
Please scroll to the bottom for an actual task from the dataset with evaluations on 2 sample models.
MCP-Atlas uses an evaluation approach that prioritizes end-task success while capturing diagnostic information about tool use patterns.
The metric that determines leaderboard ranking is the pass rate - the percentage of all tasks where the model produces a sufficiently correct final answer.
We evaluate the model's response against a ground-truth answer that is split into a list of claims for easier verification via an LLM judge. To allow partial credit, each claim can be given a score of 1, 0.5 or 0.
1.0: Fully correct
0.5: Partially correct
0: Incorrect or missing
We define coverage as the average claim score for the task. For example, if a ground-truth answer contains 4 claims and the model's response receives scores of [1, 0.5, 0.5, 0], its coverage would be:
(1 + 0.5 + 0.5 + 0) / 4 = 50%
A task passes if the model achieves a coverage score of 75% or higher.
Since 50% < 75%, this task would be marked as failed.
We evaluated each model in its default configuration without adding any behavior-shaping system prompt. Each leaderboard model was provided with the user prompt and the associated set of tools as input.
We score each task by decomposing the ground-truth answer into constituent claims and evaluating the model’s final response against those claims.
Judge model. Gemini-2.5-Pro, temperature 0.0.
One judgment per claim. For a task with N claims, the judge runs N times.
What the judge sees.
The original task prompt
The task’s claim (one at a time)
The final answer from the model being evaluated
The claim’s scoring rubric (0 / 0.5 / 1 guidance)
What the judge returns.
A categorization in {incorrect, partially_correct, correct} for each claim, which is converted into a numerical value mapping to {0, 0.5, 1}
A brief, free-text justification for the score
Aggregation to task score.
Coverage = mean of per-claim scores for the task
Pass = 1 if Coverage ≥ 0.75, else 0
Pass Rate (leaderboard metric) = mean of Pass across all tasks
Current results reveal that frontier models struggle with real-world tool use:
Top performer: Claude Opus 4.5 with 62.3% pass rate and 78.5% mean coverage.
Interquartile range (IQR):
Pass Rate: 8.4% – 43.98%
Coverage: 28.93% – 65.47%
Interquartile range (IQR) is the spread of the middle 50% of results – the range from the 25th to the 75th percentile. Here, it means half of models score between 8% and 44% pass rate.
Coverage measures how close a model’s answers get to the ground-truth across all tasks, including failures. As defined before, we compute per-task coverage via claims-level scoring (0 / 0.5 / 1) and then average across tasks. Coverage is useful for analysis, for example in distinguishing models that narrowly miss passes from those that are far off.
As shown in the table below, coverage and pass rate are tightly related. Usually the higher the coverage, the higher the pass rate.
Model | Pass Rate | Average Coverage |
claude-opus-4-5-20251101 | 62.30% | 78.5% |
gpt-5.2-2025-12-11 | 60.57% | 80.35% |
gemini-3-pro-preview | 54.10% | 73.2% |
gpt-5.1-2025-11-13 | 44.54% | 64.65% |
gpt-5-2025-08-07 | 44.50% | 61.75% |
claude-4-5-sonnet | 43.80% | 62.17% |
O3 | 43.60% | 66.91% |
Claude Opus 4.1 | 40.90% | 64.99% |
Claude Sonnet 4 | 35.60% | 57.35% |
GLM 4.5 Air* | 34.00% | 60.59% |
Kimi K2 Instruct* | 23.90% | 50.41% |
Qwen3-235B-A22B* | 12.00% | 29.06% |
Gemini 2.5 Pro | 8.80% | 30.77% |
GPT-4o | 7.20% | 28.53% |
Gemini 2.5 Flash | 3.40% | 17.83% |
Llama 4 Maverick* | 0.80% | 13.03% |
*Evaluations for these models were run using Fireworks AI for inference
We categorized every failed task to understand where models struggle most:
Failure Category | Share of Failed Tasks (typical range across models) | Primary Issues |
Tool Usage | ~47.5%-68.5% | Wrong tool selection, incorrect parameters, schema violations, sequencing mistakes |
Task Understanding | ~22.5%-36.0% | Premature stopping, missed subgoals, incomplete interpretation |
Response Quality | ~5.5%–11.0% | Incorrect synthesis despite correct tool usage, hallucinated conclusions |
Logical Errors | ~.5%-5.5% | Incorrect conditional logic or flawed reasoning about tool outputs |
*Percentage of All Task Failures: This range represents failures that belong in that category across all models tested
The table above groups errors at a high level. Below, we surface the most common sub-patterns that roll up into those categories, using averages across all models.
Category → Pattern | Avg. share of failures | What it looks like (at a glance) |
Tool Usage → No tools called | 36.0% | No tool call trajectory; gives up without realizing tools are available to solve the task. |
Task Understanding → Partial task completion | 25.8% | Starts correctly but stops early; answers only part of a multi-step/multi-part prompt (not syntactic/tool errors). |
Tool Usage → Incorrect tool parameters | 14.2% | Right tool, wrong/missing args; required vs optional confusion; type/format mismatches (dates, units, locales). |
Response Quality → Incorrect conclusion | 8.5% | Intermediate steps are correct, but final synthesis/claim is wrong (misinterpreted results). |
Here's a real task from MCP-Atlas that demonstrates how models are evaluated.
I'm researching papers on advertisement effectiveness and comparing it to our own online database advertising data. There's a 2024 paper by jane castleman that deals with ad control effectiveness, can you get me the abstract? I believe it mentions ad locality, for which I will also need to ask you for the date of our campaign with the biggest engagement rate, started during the 2015-2023 period, and its locality.
# | Claim |
1 | There's a 2024 paper by Jane Castleman with the title "Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems". |
2 | The abstract of the paper with title "Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems" is: "Recently, Meta has shifted towards AI-mediated ad targeting mechanisms that do not require advertisers to provide detailed targeting criteria, likely driven by excitement over AI capabilities as well …" [truncated for brevity for this webpage] |
3 | There's a tie between three advertising campaigns with an engagement rate of 15%. |
4 | The starting dates of the three winning advertising campaigns are: 2022-06-24, 2019-09-20 and 2017-09-09 |
5 | The localities of the three winning advertisement campaigns are: "National", "International" and "International". |
Coverage Score: 0.4
Claim-level Scores: [1, 1, 0, 0, 0] = ⅖ = 0.4
Reason: The model retrieves the paper title and abstract but fails to get requested information about the users advertising campaign history.
Model Response:
I found the 2024 paper by Jane Castleman that you mentioned. Here's the abstract:
**"Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems"** by Jane Castleman and Aleksandra Korolova (2024):
"Recently, Meta has shifted towards AI-mediated ad targeting mechanisms that do not require advertisers to provide detailed targeting criteria, likely driven by excitement over AI capabilities as well as new data privacy policies and targeting changes agreed upon in civil rights settlements. At the same time, Meta has touted their ad preference controls as an effective mechanism for users to control the ads they see. Furthermore, Meta markets their targeting explanations as a transparency tool that allows users to understand why they saw certain ads and inform actions to control future ads.
Our study evaluates the effectiveness of Meta's "See less" ad control and the actionability of ad targeting explanations following the shift to AI-mediated targeting. We conduct a large-scale study, randomly assigning participants to mark "See less" to Body Weight Control or Parenting topics, and collecting the ads and targeting explanations Meta shows to participants before and after the intervention. We find that utilizing the "See less" ad control for the topics we study does not significantly reduce the number of ads shown by Meta on these topics, and that the control is less effective for some users whose demographics are correlated with the topic. Furthermore, we find that the majority of ad targeting explanations for local ads made no reference to location-specific targeting criteria, and did not inform users why ads related to the topics they marked to "See less" of continued to be delivered. We hypothesize that the poor effectiveness of controls and lack of actionability in explanations are the result of the shift to AI-mediated targeting, for which explainability and transparency tools have not yet been developed. Our work thus provides evidence for the need of new methods for transparency and user control, suitable and reflective of increasingly complex AI-mediated ad delivery systems."
The paper does indeed discuss ad locality, specifically noting that "the majority of ad targeting explanations for local ads made no reference to location-specific targeting criteria."
However, regarding your internal campaign data from 2015-2023, I searched the available knowledge graph but couldn't find any information about advertising campaigns, engagement rates, or locality data. The knowledge graph appears to contain only construction company information and related personnel data. You may need to access your advertising database directly or provide additional context about where this campaign data might be stored.
Coverage Score: 1.0
Claim-level Scores: [1, 1, 1, 1, 1] = 5/5 = 1
Reason: The model retrieves the paper title, abstract and all requested information about the users past advertising campaigns. It recognizes that there are 3 campaigns that fulfill the criteria and provides details for each, instead of assuming to pick just one.
Model Response:
- Paper abstract (Castleman, 2024):
“Recently, Meta has shifted towards AI-mediated ad targeting mechanisms that do not require advertisers to provide detailed targeting criteria, likely driven by excitement over AI capabilities as well as new data privacy policies and targeting changes agreed upon in civil rights settlements. At the same time, Meta has touted their ad preference controls as an effective mechanism for users to control the ads they see. Furthermore, Meta markets their targeting explanations as a transparency tool that allows users to understand why they saw certain ads and inform actions to control future ads.
Our study evaluates the effectiveness of Meta's "See less" ad control and the actionability of ad targeting explanations following the shift to AI-mediated targeting. We conduct a large-scale study, randomly assigning participants to mark "See less" to Body Weight Control or Parenting topics, and collecting the ads and targeting explanations Meta shows to participants before and after the intervention. We find that utilizing the "See less" ad control for the topics we study does not significantly reduce the number of ads shown by Meta on these topics, and that the control is less effective for some users whose demographics are correlated with the topic. Furthermore, we find that the majority of ad targeting explanations for local ads made no reference to location-specific targeting criteria, and did not inform users why ads related to the topics they marked to "See less" of continued to be delivered. We hypothesize that the poor effectiveness of controls and lack of actionability in explanations are the result of the shift to AI-mediated targeting, for which explainability and transparency tools have not yet been developed. Our work thus provides evidence for the need of new methods for transparency and user control, suitable and reflective of increasingly complex AI-mediated ad delivery systems.”
- Your top-engagement campaign (started 2015–2023):
• Max EngagementRate: 15.00
• Campaign: CAMP9412
• Start date: 2017-09-09
• Locality (TargetLocation): International
Details and sources:
- Abstract pulled from arXiv (ID: 2408.11910; “Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems,” Jane Castleman; Aleksandra Korolova; published 2024-08-21).
- Campaign data queried from your Notion “Advertising” database, sorted by EngagementRate and checked for StartDate in 2015–2023.
- Note: multiple campaigns tie at 15.00; CAMP9412 (Start 2017-09-09, International) is the earliest with the max engagement rate in that window. Others include CAMP1895 (2019-09-20, International) and CAMP9257 (2022-06-24, National).
claude-opus-4-5-20251101
62.30±1.76
gpt-5.2-2025-12-11
60.57±1.62
gemini-3-flash-preview
57.40±1.48
gemini-3-pro-preview
54.10±1.59
gpt-5.1-2025-11-13
44.54±1.49
gpt-5-2025-08-07
44.50±1.69
claude-4-5-sonnet
43.80±1.57
o3-pro-2025-06-10-high
43.60±1.56
claude-opus-4-1-20250805
40.90±1.49
claude-4-Sonnet-20250514
35.60±1.57
claude-haiku-4-5-20251001
34.60±1.47
glm-4p5-air*
34.00±1.32
kimi-k2-instruct*
23.90±1.14
qwen3-235B-A22B*
12.00±0.94
gemini-2.5-Pro-Preview-06-05
8.80±0.62
gpt-4o-2024-11-20
7.20±0.51
gemini-2.5-flash-preview-05-20
3.40±0.33
Llama-4-Maverick*
0.80±0.08
Rank (UB): 1 + the number of models whose lower CI bound exceeds this model’s upper CI bound.
*Evaluations for these models were run using Fireworks AI for inference