44.5% Top Pass Rate | 1,000 Tasks | 40+ MCP Servers | 300+ Tools | 3-6 Tool Calls/Task
MCP-Atlas evaluates how well language models handle real-world tool use through the Model Context Protocol (MCP). Unlike benchmarks that test simple function calling or simulated APIs, MCP-Atlas measures performance on multi-step workflows that require discovering the right tools, calling them correctly, and synthesizing results into accurate answers.
The benchmark addresses a critical gap: while LLMs excel at conversation and reasoning, they struggle with practical tool use. The best models today pass fewer than half of our tasks, revealing massive headroom for improvement in this essential capability for AI agents.
Each task in MCP-Atlas requires several tool calls across real MCP servers running in production-like environments. Models must navigate authentic APIs with actual data, handle errors and retries, and coordinate tools across different servers – exactly what they'll face in real deployments.
The MCP-Atlas dataset consists of 1,000 human-authored tasks, each requiring multiple tool calls to solve, choosing from 40+ MCP servers and 300+ tools. This scale enables statistically meaningful comparisons while the human authoring ensures quality and realism that synthetic generation cannot match.
The evaluation includes over 40 MCP servers spanning diverse domains:
Category | Representative servers (examples) | Description |
---|---|---|
ENV-BASIC |
| Search, fetch data, maps, and weather. |
ENV-ANALYTICS |
| Query and aggregate structured data. |
ENV-PRODUCTIVITY |
| Work with files, notes, email, and docs. |
ENV-FINANCIAL |
| Manipulate financial data. |
ENV-CODING |
| Manage repos and run code. |
Tasks are designed to require multiple tool calls and cross-server orchestration:
65% of tasks require multiple servers: Testing cross-system orchestration capabilities
40% include conditional logic: Requiring dynamic strategy adjustment based on intermediate results
3-6 tool calls per task: Reflecting realistic workflow complexity
12-18 tools exposed per task: Including 3-7 required tools and 5-10 plausible distractors
Tasks range from single-domain queries requiring 2-3 tools with straightforward chaining to complex workflows requiring 5+ tools with conditional branches and error handling.
Each task includes carefully selected distractor tools that are plausible but incorrect choices. These distractors are selected by data annotators from the same category as required tools:
Semantic neighbors: Brave Search vs DuckDuckGo Search vs Exa Search
Similar functionality: MongoDB vs Airtable for data operations
Related but distinct: Git vs GitHub for repository operations
The harness exposes 12-18 tools per task (3-7 required tools plus 5-10 distractors), forcing agents to reason over tool descriptions rather than brute-force invocation.
Tasks run against real MCP servers hosted in Docker containers, not mocked endpoints or simulations. This ensures models face authentic challenges including actual API latency, real error messages, and genuine data formats.
Infrastructure:
40+ MCP servers run in a standalone Docker container
Controlled tool exposure per task (12-18 tools including targets and distractors)
Complete execution trace logging for diagnostic analysis
Data Sources:
The benchmark includes both stateful and stateless servers to test different interaction patterns:
Real-World Datasets: Servers like Notion, Airtable, MongoDB, and filesystem are populated with authentic datasets that Scale owns - including project documentation, CRM records and inventory data.
Synthetic Datasets: Servers like Slack and Google Calendar are seeded with synthetic data - conversation histories and meeting schedules that simulate typical organizational usage patterns.
Coming Soon: We'll be open-sourcing a complete benchmark dataset, MCP environment with evaluation harness, and research paper, separate from our private dataset that we use on this leaderboard to prevent contamination.
Every task in MCP-Atlas is designed to be solvable only through tool use. Models cannot succeed by relying on training data or general knowledge; they must actually discover appropriate tools, call them with correct parameters, and ground their answers in the results.
Tasks follow strict authoring guidelines:
Tool-dependent: The answer cannot be determined without calling specific tools, while ensuring the prompt doesn’t telegraph which tools to use
Time-invariant: Tasks remain valid over time with stable ground-truth answers
Unambiguous: Each task has exactly one correct answer that can be objectively verified
A typical task might ask:
“Look up Microsoft's IPO price in 1986, find the IPO prices of Apple, Amazon, and Google, then calculate which company had the lowest IPO price and by what percentage it was lower than the highest.”
This requires data retrieval via financial tools and search engines followed by arithmetic computation – all common in real applications but challenging to orchestrate correctly.
MCP-Atlas uses an evaluation approach that prioritizes end-task success while capturing diagnostic information about tool use patterns.
The metric that determines leaderboard ranking is the pass rate - the percentage of all tasks where the model produces a sufficiently correct final answer.
We evaluate the model's response against a ground-truth answer that is split into a list of claims for easier verification via an LLM judge. To allow partial credit, each claim can be given a score of 1, 0.5 or 0.
1.0: Fully correct
0.5: Partially correct
0: Incorrect or missing
We define coverage as the average claim score for the task. For example, if a ground-truth answer contains 4 claims and the model's response receives scores of [1, 0.5, 0.5, 0], its coverage would be:
(1 + 0.5 + 0.5 + 0) / 4 = 50%
A task passes if the model achieves a coverage score of 75% or higher.
Since 50% < 75%, this task would be marked as failed.
We evaluated each model in its default configuration without adding any behavior-shaping system prompt. Each leaderboard model was provided with the user prompt and the associated set of tools as input.
We score each task by decomposing the ground-truth answer into constituent claims and evaluating the model’s final response against those claims.
Judge model. Gemini-2.5-Pro, temperature 0.0.
One judgment per claim. For a task with N claims, the judge runs N times.
What the judge sees.
The original task prompt
The task’s claim (one at a time)
The final answer from the model being evaluated
The claim’s scoring rubric (0 / 0.5 / 1 guidance)
What the judge returns.
A categorization in {incorrect, partially_correct, correct} for each claim, which is converted into a numerical value mapping to {0, 0.5, 1}
A brief, free-text justification for the score
Aggregation to task score.
Coverage = mean of per-claim scores for the task
Pass = 1 if Coverage ≥ 0.75, else 0
Pass Rate (leaderboard metric) = mean of Pass across all tasks
Current results reveal that frontier models struggle with real-world tool use:
Top performer: GPT-5 with a 44.5% pass rate and 61.75% average coverage
Interquartile range (IQR):
Pass Rate: 8.0% – 38.3%
Coverage: 28.8% – 61.2%
Interquartile range (IQR) is the spread of the middle 50% of results – the range from the 25th to the 75th percentile. Here, it means half of models score between 8% and 38.3% pass rate.
Coverage measures how close a model’s answers get to the ground-truth across all tasks, including failures. As defined before, we compute per-task coverage via claims-level scoring (0 / 0.5 / 1) and then average across tasks. Coverage is useful for analysis, for example in distinguishing models that narrowly miss passes from those that are far off.
As shown in the table below, coverage and pass rate are tightly related. Usually the higher the coverage, the higher the pass rate.
Model | Pass Rate | Average Coverage |
GPT-5 | 44.50% | 61.75% |
O3 | 43.60% | 66.91% |
Claude Opus 4.1 | 40.90% | 64.99% |
Claude Sonnet 4 | 35.60% | 57.35% |
GLM 4.5 | 34.00% | 60.59% |
Kimi K2 | 23.90% | 50.41% |
Qwen3 | 12.00% | 29.06% |
Gemini 2.5 Pro | 8.80% | 30.77% |
GPT-4o | 7.20% | 28.53% |
Gemini 2.5 Flash | 3.40% | 17.83% |
Maverick | 0.80% | 13.03% |
We categorized every failed task to understand where models struggle most:
Failure Category | % of All Task Failures* | Primary Issues |
Tool Usage | 34-52% | Wrong tool selection, incorrect parameters, schema violations, improper sequencing |
Task Understanding | 23-38% | Premature stopping, missing subgoals, incomplete requirement interpretation |
Response Quality | 3-8% | Inaccurate synthesis despite correct tool use, hallucinated information |
Logical Errors | 1-2% | Flawed reasoning about tool outputs, incorrect conditional logic |
*Percentage of All Task Failures: This range represents failures that belong in that category across all models tested
The table above groups errors at a high level. Below, we surface the most common sub-patterns that roll up into those categories, using averages across all models.
Category → Pattern | Avg. share of failures | What it looks like (at a glance) |
Tool Usage → No tools called | 36.31% | No tool call trajectory; gives up without realizing tools are available to solve the task. |
Task Understanding → Partial task completion | 29.75% | Starts correctly but stops early; answers only part of a multi-step/multi-part prompt (not syntactic/tool errors). |
Tool Usage → Incorrect tool parameters | 12.75% | Right tool, wrong/missing args; required vs optional confusion; type/format mismatches (dates, units, locales). |
Response Quality → Incorrect conclusion | 12.26% | Intermediate steps are correct, but final synthesis/claim is wrong (misinterpreted results). |
Below is a sample failure categorization for GPT-5.
Failure mode | % of GPT-5 failures |
Partial task completion | 35.07% |
No tools called | 20.91% |
Incorrect conclusion | 16.77% |
Incorrect tool parameters | 15.03% |
Skipped steps | 7.69% |
Missing tool calls | 2.37% |
Misunderstood task | 1.38% |
Insufficient exploration | 0.79% |
Capability discovery missing | 0.00% |
Environment constraint violation | 0.00% |
Wrong tool selection | 0.00% |
Across 1,000 private tasks, the best models pass less than 50% of tasks. The dominant failure modes are tool-use mistakes and premature stopping, not raw knowledge gaps. Together, these findings point to bottlenecks in reliable tool discovery, parameterization, and multi-step completion.
Here's a real task from MCP-Atlas that demonstrates how models are evaluated.
I'm researching papers on advertisement effectiveness and comparing it to our own online database advertising data. There's a 2024 paper by jane castleman that deals with ad control effectiveness, can you get me the abstract? I believe it mentions ad locality, for which I will also need to ask you for the date of our campaign with the biggest engagement rate, started during the 2015-2023 period, and its locality.
# | Claim |
1 | There's a 2024 paper by Jane Castleman with the title "Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems". |
2 | The abstract of the paper with title "Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems" is: "Recently, Meta has shifted towards AI-mediated ad targeting mechanisms that do not require advertisers to provide detailed targeting criteria, likely driven by excitement over AI capabilities as well …" [truncated for brevity for this webpage] |
3 | There's a tie between three advertising campaigns with an engagement rate of 15%. |
4 | The starting dates of the three winning advertising campaigns are: 2022-06-24, 2019-09-20 and 2017-09-09 |
5 | The localities of the three winning advertisement campaigns are: "National", "International" and "International". |
Coverage Score: 0.4
Claim-level Scores: [1, 1, 0, 0, 0] = ⅖ = 0.4
Reason: The model retrieves the paper title and abstract but fails to get requested information about the users advertising campaign history.
Model Response:
I found the 2024 paper by Jane Castleman that you mentioned. Here's the abstract:
**"Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems"** by Jane Castleman and Aleksandra Korolova (2024):
"Recently, Meta has shifted towards AI-mediated ad targeting mechanisms that do not require advertisers to provide detailed targeting criteria, likely driven by excitement over AI capabilities as well as new data privacy policies and targeting changes agreed upon in civil rights settlements. At the same time, Meta has touted their ad preference controls as an effective mechanism for users to control the ads they see. Furthermore, Meta markets their targeting explanations as a transparency tool that allows users to understand why they saw certain ads and inform actions to control future ads.
Our study evaluates the effectiveness of Meta's "See less" ad control and the actionability of ad targeting explanations following the shift to AI-mediated targeting. We conduct a large-scale study, randomly assigning participants to mark "See less" to Body Weight Control or Parenting topics, and collecting the ads and targeting explanations Meta shows to participants before and after the intervention. We find that utilizing the "See less" ad control for the topics we study does not significantly reduce the number of ads shown by Meta on these topics, and that the control is less effective for some users whose demographics are correlated with the topic. Furthermore, we find that the majority of ad targeting explanations for local ads made no reference to location-specific targeting criteria, and did not inform users why ads related to the topics they marked to "See less" of continued to be delivered. We hypothesize that the poor effectiveness of controls and lack of actionability in explanations are the result of the shift to AI-mediated targeting, for which explainability and transparency tools have not yet been developed. Our work thus provides evidence for the need of new methods for transparency and user control, suitable and reflective of increasingly complex AI-mediated ad delivery systems."
The paper does indeed discuss ad locality, specifically noting that "the majority of ad targeting explanations for local ads made no reference to location-specific targeting criteria."
However, regarding your internal campaign data from 2015-2023, I searched the available knowledge graph but couldn't find any information about advertising campaigns, engagement rates, or locality data. The knowledge graph appears to contain only construction company information and related personnel data. You may need to access your advertising database directly or provide additional context about where this campaign data might be stored.
Coverage Score: 1.0
Claim-level Scores: [1, 1, 1, 1, 1] = 5/5 = 1
Reason: The model retrieves the paper title, abstract and all requested information about the users past advertising campaigns. It recognizes that there are 3 campaigns that fulfill the criteria and provides details for each, instead of assuming to pick just one.
Model Response:
- Paper abstract (Castleman, 2024):
“Recently, Meta has shifted towards AI-mediated ad targeting mechanisms that do not require advertisers to provide detailed targeting criteria, likely driven by excitement over AI capabilities as well as new data privacy policies and targeting changes agreed upon in civil rights settlements. At the same time, Meta has touted their ad preference controls as an effective mechanism for users to control the ads they see. Furthermore, Meta markets their targeting explanations as a transparency tool that allows users to understand why they saw certain ads and inform actions to control future ads.
Our study evaluates the effectiveness of Meta's "See less" ad control and the actionability of ad targeting explanations following the shift to AI-mediated targeting. We conduct a large-scale study, randomly assigning participants to mark "See less" to Body Weight Control or Parenting topics, and collecting the ads and targeting explanations Meta shows to participants before and after the intervention. We find that utilizing the "See less" ad control for the topics we study does not significantly reduce the number of ads shown by Meta on these topics, and that the control is less effective for some users whose demographics are correlated with the topic. Furthermore, we find that the majority of ad targeting explanations for local ads made no reference to location-specific targeting criteria, and did not inform users why ads related to the topics they marked to "See less" of continued to be delivered. We hypothesize that the poor effectiveness of controls and lack of actionability in explanations are the result of the shift to AI-mediated targeting, for which explainability and transparency tools have not yet been developed. Our work thus provides evidence for the need of new methods for transparency and user control, suitable and reflective of increasingly complex AI-mediated ad delivery systems.”
- Your top-engagement campaign (started 2015–2023):
• Max EngagementRate: 15.00
• Campaign: CAMP9412
• Start date: 2017-09-09
• Locality (TargetLocation): International
Details and sources:
- Abstract pulled from arXiv (ID: 2408.11910; “Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems,” Jane Castleman; Aleksandra Korolova; published 2024-08-21).
- Campaign data queried from your Notion “Advertising” database, sorted by EngagementRate and checked for StartDate in 2015–2023.
- Note: multiple campaigns tie at 15.00; CAMP9412 (Start 2017-09-09, International) is the earliest with the max engagement rate in that window. Others include CAMP1895 (2019-09-20, International) and CAMP9257 (2022-06-24, National).
gpt-5-2025-08-07
44.50±1.69
o3-pro-2025-06-10-high
43.60±1.56
claude-opus-4-1-20250805
40.90±1.49
claude-4-Sonnet-20250514
35.60±1.57
glm-4p5-air
34.00±1.32
kimi-k2-instruct
23.90±1.14
qwen3-235B-A22B
12.00±0.94
gemini-2.5-Pro-Preview-06-05
8.80±0.62
gpt-4o-2024-11-20
7.20±0.51
gemini-2.5-flash-preview-05-20
3.40±0.33
Llama-4-Maverick
0.80±0.08
Rank (UB): 1 + the number of models whose lower CI bound exceeds this model’s upper CI bound.