Scale AI logo
SEAL Logo
Scale AI logo
SEAL Logo
Agentic LBs
MCP Atlas
SWE-Bench Pro (Public Dataset)
SWE-Bench Pro (Commercial Dataset)
Remote Labor Index (RLI)
Safety LBs
Frontier LBs
Legacy LBs
2025 Scale AI. All rights reserved.

MCP Atlas

Key Metrics at a Glance

62.3% Top Pass Rate | 1,000 Tasks (500 public + 500 held-out) | 36 MCP Servers | 220 Tools | 3–6 Tool Calls/Task


Introduction

MCP-Atlas evaluates how well language models handle real-world tool use through the Model Context Protocol (MCP). Unlike benchmarks that focus on simple function calling, small tool sets, or simulated APIs, MCP-Atlas measures performance on realistic, multi-step workflows where models must:

  • discover the right tools from a noisy tool menu,

  • call tools with correct parameters and types,

  • recover from errors,

  • and synthesize tool outputs into an accurate final answer.

This benchmark targets a core deployment gap: models can be strong at reasoning and conversation, but still fail at reliable end-to-end tool use. Even the best-performing models still fail a large fraction of tasks, leaving meaningful headroom for improvement. The Scale Research team published the paper, the dataset on HuggingFace, and the code repository on GitHub.


Release Artifacts

This leaderboard is part of the open-source MCP-Atlas release:

  • Paper: benchmark design, scoring methodology, and baseline results

  • Dataset (Hugging Face): 500-task public subset

  • Environment + harness (GitHub): containerized setup that runs the real MCP servers and enforces controlled tool exposure

The other 500 tasks are held out to preserve leaderboard integrity and reduce overfitting.


Dataset Overview

Scale and Composition

The MCP-Atlas dataset consists of 1,000 human-authored tasks, spanning 36 real MCP servers and 220 tools. The public leaderboard subset contains 500 tasks, designed to be representative of the full benchmark. Each task is written to require real tool use and to reflect realistic “agent” workflows:

  • 3–6 tool calls per task

  • Cross-server orchestration is required for the vast majority of tasks

  • Approximately one-third of tasks include conditional branching, where later actions depend on earlier tool outputs

MCP Server Distribution

The evaluation covers MCP servers across several broad buckets:

Category

Representative servers (examples)

Description

ENV-BASIC

brave_search, ddg_search, exa, weather, google-maps, fetch

Search, fetch data, maps, and weather.

ENV-ANALYTICS

mongodb, airtable, calculator

Query and aggregate structured data.

ENV-PRODUCTIVITY

filesystem, notion, slack, google-workspace, arxiv, pubmed

Work with files, notes, email, and docs.

ENV-FINANCIAL

twelvedata, alchemy

Manipulate financial data.

ENV-CODING

git, github, mcp-code-executor, cli-mcp-server, e2b-server

Manage repos and run code.

Tool Exposure and Distractors

To prevent brute-force tool calling and to stress tool discovery, each task exposes a limited tool surface:

  • 10–25 tools exposed per task

  • typically 3–7 required tools

  • plus 5–10 plausible distractors

Distractors are typically sampled from the same servers to ensure that success depends on accurate discovery and parameterization rather than simple server name recognition. This directly tests tool discovery capabilities that prior work identifies as a primary failure mode.


MCP Environment

Tasks run against real MCP servers hosted in Docker containers, not mocked endpoints or simulations. This ensures models face authentic challenges including actual API latency, real error messages, and genuine data formats.

Infrastructure:

  • Containerized environment with the benchmark’s MCP servers

  • Per-task controlled tool exposure (targets + distractors)

  • Execution trace logging to support diagnostic analysis

State and Time-Invariance:

  • For stateful tools (e.g., docs/databases/files), the environment is seeded with fixed datasets so tasks remain time-invariant and reproducible.


Evaluation Methodology

Task Design Principles

Every task is designed to be solvable only through tool use:

  • Tool-dependent: cannot be answered correctly from parametric knowledge alone

  • Time-invariant: stable ground-truth answers over time

  • Unambiguous: a single correct outcome, verifiable by claims

A typical task might ask:

“Look up Microsoft's IPO price in 1986, find the IPO prices of Apple, Amazon, and Google, then calculate which company had the lowest IPO price and by what percentage it was lower than the highest.”

This requires data retrieval via financial tools and search engines followed by arithmetic computation – all common in real applications but challenging to orchestrate correctly.

Please scroll to the bottom for an actual task from the dataset with evaluations on 2 sample models.

Scoring Framework

MCP-Atlas uses an evaluation approach that prioritizes end-task success while capturing diagnostic information about tool use patterns.

Pass Rate

The metric that determines leaderboard ranking is the pass rate - the percentage of all tasks where the model produces a sufficiently correct final answer.

We evaluate the model's response against a ground-truth answer that is split into a list of claims for easier verification via an LLM judge. To allow partial credit, each claim can be given a score of 1, 0.5 or 0.

  • 1.0: Fully correct

  • 0.5: Partially correct

  • 0: Incorrect or missing

We define coverage as the average claim score for the task. For example, if a ground-truth answer contains 4 claims and the model's response receives scores of [1, 0.5, 0.5, 0], its coverage would be:

(1 + 0.5 + 0.5 + 0) / 4 = 50%

A task passes if the model achieves a coverage score of 75% or higher.

Since 50% < 75%, this task would be marked as failed.

We evaluated each model in its default configuration without adding any behavior-shaping system prompt. Each leaderboard model was provided with the user prompt and the associated set of tools as input.

Judging Protocol

We score each task by decomposing the ground-truth answer into constituent claims and evaluating the model’s final response against those claims.

Judge model. Gemini-2.5-Pro, temperature 0.0.

One judgment per claim. For a task with N claims, the judge runs N times.

What the judge sees.

  • The original task prompt

  • The task’s claim (one at a time)

  • The final answer from the model being evaluated

  • The claim’s scoring rubric (0 / 0.5 / 1 guidance)

What the judge returns.

  • A categorization in {incorrect, partially_correct, correct} for each claim, which is converted into a numerical value mapping to {0, 0.5, 1}

  • A brief, free-text justification for the score

Aggregation to task score.

  • Coverage = mean of per-claim scores for the task

  • Pass = 1 if Coverage ≥ 0.75, else 0

  • Pass Rate (leaderboard metric) = mean of Pass across all tasks


Key Findings

Performance Overview

Current results reveal that frontier models struggle with real-world tool use:

  • Top performer: Claude Opus 4.5 with 62.3% pass rate and 78.5% mean coverage.

  • Interquartile range (IQR):

    • Pass Rate: 8.4% – 43.98%

    • Coverage: 28.93% – 65.47%

Interquartile range (IQR) is the spread of the middle 50% of results – the range from the 25th to the 75th percentile. Here, it means half of models score between 8% and 44% pass rate.

A Note on Coverage

Coverage measures how close a model’s answers get to the ground-truth across all tasks, including failures. As defined before, we compute per-task coverage via claims-level scoring (0 / 0.5 / 1) and then average across tasks. Coverage is useful for analysis, for example in distinguishing models that narrowly miss passes from those that are far off.

As shown in the table below, coverage and pass rate are tightly related. Usually the higher the coverage, the higher the pass rate.

Model

Pass Rate

Average Coverage

claude-opus-4-5-20251101

62.30%

78.5%

gpt-5.2-2025-12-11

60.57%

80.35%

gemini-3-pro-preview

54.10%

73.2%

gpt-5.1-2025-11-13

44.54%

64.65%

gpt-5-2025-08-07

44.50%

61.75%

claude-4-5-sonnet

43.80%

62.17%

O3

43.60%

66.91%

Claude Opus 4.1

40.90%

64.99%

Claude Sonnet 4

35.60%

57.35%

GLM 4.5 Air*

34.00%

60.59%

Kimi K2 Instruct*

23.90%

50.41%

Qwen3-235B-A22B*

12.00%

29.06%

Gemini 2.5 Pro

8.80%

30.77%

GPT-4o

7.20%

28.53%

Gemini 2.5 Flash

3.40%

17.83%

Llama 4 Maverick*

0.80%

13.03%

*Evaluations for these models were run using Fireworks AI for inference

Failure Analysis

We categorized every failed task to understand where models struggle most:

Failure Category

Share of Failed Tasks (typical range across models)

Primary Issues

Tool Usage

~47.5%-68.5%

Wrong tool selection, incorrect parameters, schema violations, sequencing mistakes

Task Understanding

~22.5%-36.0%

Premature stopping, missed subgoals, incomplete interpretation

Response Quality

~5.5%–11.0%

Incorrect synthesis despite correct tool usage, hallucinated conclusions

Logical Errors

~.5%-5.5%

Incorrect conditional logic or flawed reasoning about tool outputs

*Percentage of All Task Failures: This range represents failures that belong in that category across all models tested

Critical Failure Patterns

The table above groups errors at a high level. Below, we surface the most common sub-patterns that roll up into those categories, using averages across all models.

Category → Pattern

Avg. share of failures

What it looks like (at a glance)

Tool Usage → No tools called

36.0%

No tool call trajectory; gives up without realizing tools are available to solve the task.

Task Understanding → Partial task completion

25.8%

Starts correctly but stops early; answers only part of a multi-step/multi-part prompt (not syntactic/tool errors).

Tool Usage → Incorrect tool parameters

14.2%

Right tool, wrong/missing args; required vs optional confusion; type/format mismatches (dates, units, locales).

Response Quality → Incorrect conclusion

8.5%

Intermediate steps are correct, but final synthesis/claim is wrong (misinterpreted results).


Example Task & Evaluation

Here's a real task from MCP-Atlas that demonstrates how models are evaluated.

Task Prompt

I'm researching papers on advertisement effectiveness and comparing it to our own online database advertising data. There's a 2024 paper by jane castleman that deals with ad control effectiveness, can you get me the abstract? I believe it mentions ad locality, for which I will also need to ask you for the date of our campaign with the biggest engagement rate, started during the 2015-2023 period, and its locality.

Ground Truth Claims

#

Claim

1

There's a 2024 paper by Jane Castleman with the title "Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems".

2

The abstract of the paper with title "Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems" is:

"Recently, Meta has shifted towards AI-mediated ad targeting mechanisms that do not require advertisers to provide detailed targeting criteria, likely driven by excitement over AI capabilities as well …"

[truncated for brevity for this webpage]

3

There's a tie between three advertising campaigns with an engagement rate of 15%.

4

The starting dates of the three winning advertising campaigns are: 2022-06-24, 2019-09-20 and 2017-09-09

5

The localities of the three winning advertisement campaigns are: "National", "International" and "International".


Model Response – GLM 4.5 (Failed)

Coverage Score: 0.4

Claim-level Scores: [1, 1, 0, 0, 0] = ⅖ = 0.4

Reason: The model retrieves the paper title and abstract but fails to get requested information about the users advertising campaign history.

Model Response:

I found the 2024 paper by Jane Castleman that you mentioned. Here's the abstract:

**"Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems"** by Jane Castleman and Aleksandra Korolova (2024):

"Recently, Meta has shifted towards AI-mediated ad targeting mechanisms that do not require advertisers to provide detailed targeting criteria, likely driven by excitement over AI capabilities as well as new data privacy policies and targeting changes agreed upon in civil rights settlements. At the same time, Meta has touted their ad preference controls as an effective mechanism for users to control the ads they see. Furthermore, Meta markets their targeting explanations as a transparency tool that allows users to understand why they saw certain ads and inform actions to control future ads.

Our study evaluates the effectiveness of Meta's "See less" ad control and the actionability of ad targeting explanations following the shift to AI-mediated targeting. We conduct a large-scale study, randomly assigning participants to mark "See less" to Body Weight Control or Parenting topics, and collecting the ads and targeting explanations Meta shows to participants before and after the intervention. We find that utilizing the "See less" ad control for the topics we study does not significantly reduce the number of ads shown by Meta on these topics, and that the control is less effective for some users whose demographics are correlated with the topic. Furthermore, we find that the majority of ad targeting explanations for local ads made no reference to location-specific targeting criteria, and did not inform users why ads related to the topics they marked to "See less" of continued to be delivered. We hypothesize that the poor effectiveness of controls and lack of actionability in explanations are the result of the shift to AI-mediated targeting, for which explainability and transparency tools have not yet been developed. Our work thus provides evidence for the need of new methods for transparency and user control, suitable and reflective of increasingly complex AI-mediated ad delivery systems."

The paper does indeed discuss ad locality, specifically noting that "the majority of ad targeting explanations for local ads made no reference to location-specific targeting criteria."

However, regarding your internal campaign data from 2015-2023, I searched the available knowledge graph but couldn't find any information about advertising campaigns, engagement rates, or locality data. The knowledge graph appears to contain only construction company information and related personnel data. You may need to access your advertising database directly or provide additional context about where this campaign data might be stored.

Model Response – GPT-5 (Passed)

Coverage Score: 1.0

Claim-level Scores: [1, 1, 1, 1, 1] = 5/5 = 1

Reason: The model retrieves the paper title, abstract and all requested information about the users past advertising campaigns. It recognizes that there are 3 campaigns that fulfill the criteria and provides details for each, instead of assuming to pick just one.

Model Response:

- Paper abstract (Castleman, 2024):

“Recently, Meta has shifted towards AI-mediated ad targeting mechanisms that do not require advertisers to provide detailed targeting criteria, likely driven by excitement over AI capabilities as well as new data privacy policies and targeting changes agreed upon in civil rights settlements. At the same time, Meta has touted their ad preference controls as an effective mechanism for users to control the ads they see. Furthermore, Meta markets their targeting explanations as a transparency tool that allows users to understand why they saw certain ads and inform actions to control future ads.

Our study evaluates the effectiveness of Meta's "See less" ad control and the actionability of ad targeting explanations following the shift to AI-mediated targeting. We conduct a large-scale study, randomly assigning participants to mark "See less" to Body Weight Control or Parenting topics, and collecting the ads and targeting explanations Meta shows to participants before and after the intervention. We find that utilizing the "See less" ad control for the topics we study does not significantly reduce the number of ads shown by Meta on these topics, and that the control is less effective for some users whose demographics are correlated with the topic. Furthermore, we find that the majority of ad targeting explanations for local ads made no reference to location-specific targeting criteria, and did not inform users why ads related to the topics they marked to "See less" of continued to be delivered. We hypothesize that the poor effectiveness of controls and lack of actionability in explanations are the result of the shift to AI-mediated targeting, for which explainability and transparency tools have not yet been developed. Our work thus provides evidence for the need of new methods for transparency and user control, suitable and reflective of increasingly complex AI-mediated ad delivery systems.”

- Your top-engagement campaign (started 2015–2023):

• Max EngagementRate: 15.00

• Campaign: CAMP9412

• Start date: 2017-09-09

• Locality (TargetLocation): International

Details and sources:

- Abstract pulled from arXiv (ID: 2408.11910; “Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems,” Jane Castleman; Aleksandra Korolova; published 2024-08-21).

- Campaign data queried from your Notion “Advertising” database, sorted by EngagementRate and checked for StartDate in 2015–2023.

- Note: multiple campaigns tie at 15.00; CAMP9412 (Start 2017-09-09, International) is the earliest with the max engagement rate in that window. Others include CAMP1895 (2019-09-20, International) and CAMP9257 (2022-06-24, National).


Last updated: December 18, 2025

Performance Comparison

1

claude-opus-4-5-20251101

62.30±1.76

1

gpt-5.2-2025-12-11

NEW

60.57±1.62

3

gemini-3-flash-preview

NEW

57.40±1.48

4

gemini-3-pro-preview

NEW

54.10±1.59

5

gpt-5.1-2025-11-13

44.54±1.49

5

gpt-5-2025-08-07

44.50±1.69

5

claude-4-5-sonnet

43.80±1.57

5

o3-pro-2025-06-10-high

43.60±1.56

7

claude-opus-4-1-20250805

40.90±1.49

10

claude-4-Sonnet-20250514

35.60±1.57

10

claude-haiku-4-5-20251001

34.60±1.47

10

glm-4p5-air*

34.00±1.32

13

kimi-k2-instruct*

23.90±1.14

14

qwen3-235B-A22B*

12.00±0.94

15

gemini-2.5-Pro-Preview-06-05

8.80±0.62

16

gpt-4o-2024-11-20

7.20±0.51

17

gemini-2.5-flash-preview-05-20

3.40±0.33

18

Llama-4-Maverick*

0.80±0.08

Rank (UB): 1 + the number of models whose lower CI bound exceeds this model’s upper CI bound.

*Evaluations for these models were run using Fireworks AI for inference