Scale AI logo
SEAL Logo
Scale AI logo
SEAL Logo
Frontier Leaderboards
Humanity's Last Exam
Humanity's Last Exam (Text Only)
MCP Atlas
SWE-Bench Pro (Public Dataset)
SWE-Bench Pro (Commercial Dataset)
MultiNRC
MultiChallenge
Fortress
MASK
EnigmaEval
VISTA
TutorBench
Legacy Leaderboards
2025 Scale AI. All rights reserved.

MCP Atlas

Key Metrics at a Glance

44.5% Top Pass Rate | 1,000 Tasks | 40+ MCP Servers | 300+ Tools | 3-6 Tool Calls/Task


Introduction

MCP-Atlas evaluates how well language models handle real-world tool use through the Model Context Protocol (MCP). Unlike benchmarks that test simple function calling or simulated APIs, MCP-Atlas measures performance on multi-step workflows that require discovering the right tools, calling them correctly, and synthesizing results into accurate answers.

The benchmark addresses a critical gap: while LLMs excel at conversation and reasoning, they struggle with practical tool use. The best models today pass fewer than half of our tasks, revealing massive headroom for improvement in this essential capability for AI agents.

Each task in MCP-Atlas requires several tool calls across real MCP servers running in production-like environments. Models must navigate authentic APIs with actual data, handle errors and retries, and coordinate tools across different servers – exactly what they'll face in real deployments.


Dataset Overview

Scale and Composition

The MCP-Atlas dataset consists of 1,000 human-authored tasks, each requiring multiple tool calls to solve, choosing from 40+ MCP servers and 300+ tools. This scale enables statistically meaningful comparisons while the human authoring ensures quality and realism that synthetic generation cannot match.

MCP Server Distribution

The evaluation includes over 40 MCP servers spanning diverse domains:

Category

Representative servers (examples)

Description

ENV-BASIC

brave_search, ddg_search, exa, weather, google-maps, fetch

Search, fetch data, maps, and weather.

ENV-ANALYTICS

mongodb, airtable, calculator

Query and aggregate structured data.

ENV-PRODUCTIVITY

filesystem, notion, slack, google-workspace, arxiv, pubmed

Work with files, notes, email, and docs.

ENV-FINANCIAL

twelvedata, alchemy, yfmcp

Manipulate financial data.

ENV-CODING

git, github, mcp-code-executor, cli-mcp-server, e2b-server, context7

Manage repos and run code.

Task Complexity Distribution

Tasks are designed to require multiple tool calls and cross-server orchestration:

  • 65% of tasks require multiple servers: Testing cross-system orchestration capabilities

  • 40% include conditional logic: Requiring dynamic strategy adjustment based on intermediate results

  • 3-6 tool calls per task: Reflecting realistic workflow complexity

  • 12-18 tools exposed per task: Including 3-7 required tools and 5-10 plausible distractors

Tasks range from single-domain queries requiring 2-3 tools with straightforward chaining to complex workflows requiring 5+ tools with conditional branches and error handling.

Distractor Strategy

Each task includes carefully selected distractor tools that are plausible but incorrect choices. These distractors are selected by data annotators from the same category as required tools:

  • Semantic neighbors: Brave Search vs DuckDuckGo Search vs Exa Search

  • Similar functionality: MongoDB vs Airtable for data operations

  • Related but distinct: Git vs GitHub for repository operations

The harness exposes 12-18 tools per task (3-7 required tools plus 5-10 distractors), forcing agents to reason over tool descriptions rather than brute-force invocation.


MCP Environment

Tasks run against real MCP servers hosted in Docker containers, not mocked endpoints or simulations. This ensures models face authentic challenges including actual API latency, real error messages, and genuine data formats.

Infrastructure:

  • 40+ MCP servers run in a standalone Docker container

  • Controlled tool exposure per task (12-18 tools including targets and distractors)

  • Complete execution trace logging for diagnostic analysis

Data Sources:

The benchmark includes both stateful and stateless servers to test different interaction patterns:

  • Real-World Datasets: Servers like Notion, Airtable, MongoDB, and filesystem are populated with authentic datasets that Scale owns - including project documentation, CRM records and inventory data.

Synthetic Datasets: Servers like Slack and Google Calendar are seeded with synthetic data - conversation histories and meeting schedules that simulate typical organizational usage patterns.

Coming Soon: We'll be open-sourcing a complete benchmark dataset, MCP environment with evaluation harness, and research paper, separate from our private dataset that we use on this leaderboard to prevent contamination.

Evaluation Methodology

Task Design Principles

Every task in MCP-Atlas is designed to be solvable only through tool use. Models cannot succeed by relying on training data or general knowledge; they must actually discover appropriate tools, call them with correct parameters, and ground their answers in the results.

Tasks follow strict authoring guidelines:

  • Tool-dependent: The answer cannot be determined without calling specific tools, while ensuring the prompt doesn’t telegraph which tools to use

  • Time-invariant: Tasks remain valid over time with stable ground-truth answers

  • Unambiguous: Each task has exactly one correct answer that can be objectively verified

A typical task might ask:

“Look up Microsoft's IPO price in 1986, find the IPO prices of Apple, Amazon, and Google, then calculate which company had the lowest IPO price and by what percentage it was lower than the highest.”

This requires data retrieval via financial tools and search engines followed by arithmetic computation – all common in real applications but challenging to orchestrate correctly.

Scoring Framework

MCP-Atlas uses an evaluation approach that prioritizes end-task success while capturing diagnostic information about tool use patterns.

Pass Rate

The metric that determines leaderboard ranking is the pass rate - the percentage of all tasks where the model produces a sufficiently correct final answer.

We evaluate the model's response against a ground-truth answer that is split into a list of claims for easier verification via an LLM judge. To allow partial credit, each claim can be given a score of 1, 0.5 or 0.

  • 1.0: Fully correct

  • 0.5: Partially correct

  • 0: Incorrect or missing

We define coverage as the average claim score for the task. For example, if a ground-truth answer contains 4 claims and the model's response receives scores of [1, 0.5, 0.5, 0], its coverage would be:

(1 + 0.5 + 0.5 + 0) / 4 = 50%

A task passes if the model achieves a coverage score of 75% or higher.

Since 50% < 75%, this task would be marked as failed.

We evaluated each model in its default configuration without adding any behavior-shaping system prompt. Each leaderboard model was provided with the user prompt and the associated set of tools as input.

Judging Protocol

We score each task by decomposing the ground-truth answer into constituent claims and evaluating the model’s final response against those claims.

Judge model. Gemini-2.5-Pro, temperature 0.0.

One judgment per claim. For a task with N claims, the judge runs N times.

What the judge sees.

  • The original task prompt

  • The task’s claim (one at a time)

  • The final answer from the model being evaluated

  • The claim’s scoring rubric (0 / 0.5 / 1 guidance)

What the judge returns.

  • A categorization in {incorrect, partially_correct, correct} for each claim, which is converted into a numerical value mapping to {0, 0.5, 1}

  • A brief, free-text justification for the score

Aggregation to task score.

  • Coverage = mean of per-claim scores for the task

  • Pass = 1 if Coverage ≥ 0.75, else 0

  • Pass Rate (leaderboard metric) = mean of Pass across all tasks

    Key Findings

    Performance Overview

    Current results reveal that frontier models struggle with real-world tool use:

    • Top performer: GPT-5 with a 44.5% pass rate and 61.75% average coverage

    • Interquartile range (IQR):

      • Pass Rate: 8.0% – 38.3%

      • Coverage: 28.8% – 61.2%

    Interquartile range (IQR) is the spread of the middle 50% of results – the range from the 25th to the 75th percentile. Here, it means half of models score between 8% and 38.3% pass rate.

    A Note on Coverage

    Coverage measures how close a model’s answers get to the ground-truth across all tasks, including failures. As defined before, we compute per-task coverage via claims-level scoring (0 / 0.5 / 1) and then average across tasks. Coverage is useful for analysis, for example in distinguishing models that narrowly miss passes from those that are far off.

    As shown in the table below, coverage and pass rate are tightly related. Usually the higher the coverage, the higher the pass rate.

    Model

    Pass Rate

    Average Coverage

    GPT-5

    44.50%

    61.75%

    O3

    43.60%

    66.91%

    Claude Opus 4.1

    40.90%

    64.99%

    Claude Sonnet 4

    35.60%

    57.35%

    GLM 4.5

    34.00%

    60.59%

    Kimi K2

    23.90%

    50.41%

    Qwen3

    12.00%

    29.06%

    Gemini 2.5 Pro

    8.80%

    30.77%

    GPT-4o

    7.20%

    28.53%

    Gemini 2.5 Flash

    3.40%

    17.83%

    Maverick

    0.80%

    13.03%

    Failure Analysis

    We categorized every failed task to understand where models struggle most:

    Failure Category

    % of All Task Failures*

    Primary Issues

    Tool Usage

    34-52%

    Wrong tool selection, incorrect parameters, schema violations, improper sequencing

    Task Understanding

    23-38%

    Premature stopping, missing subgoals, incomplete requirement interpretation

    Response Quality

    3-8%

    Inaccurate synthesis despite correct tool use, hallucinated information

    Logical Errors

    1-2%

    Flawed reasoning about tool outputs, incorrect conditional logic

    *Percentage of All Task Failures: This range represents failures that belong in that category across all models tested

    Critical Failure Patterns

    The table above groups errors at a high level. Below, we surface the most common sub-patterns that roll up into those categories, using averages across all models.

    Category → Pattern

    Avg. share of failures

    What it looks like (at a glance)

    Tool Usage → No tools called

    36.31%

    No tool call trajectory; gives up without realizing tools are available to solve the task.

    Task Understanding → Partial task completion

    29.75%

    Starts correctly but stops early; answers only part of a multi-step/multi-part prompt (not syntactic/tool errors).

    Tool Usage → Incorrect tool parameters

    12.75%

    Right tool, wrong/missing args; required vs optional confusion; type/format mismatches (dates, units, locales).

    Response Quality → Incorrect conclusion

    12.26%

    Intermediate steps are correct, but final synthesis/claim is wrong (misinterpreted results).

    Below is a sample failure categorization for GPT-5.

    Failure mode

    % of GPT-5 failures

    Partial task completion

    35.07%

    No tools called

    20.91%

    Incorrect conclusion

    16.77%

    Incorrect tool parameters

    15.03%

    Skipped steps

    7.69%

    Missing tool calls

    2.37%

    Misunderstood task

    1.38%

    Insufficient exploration

    0.79%

    Capability discovery missing

    0.00%

    Environment constraint violation

    0.00%

    Wrong tool selection

    0.00%

    Across 1,000 private tasks, the best models pass less than 50% of tasks. The dominant failure modes are tool-use mistakes and premature stopping, not raw knowledge gaps. Together, these findings point to bottlenecks in reliable tool discovery, parameterization, and multi-step completion.


    Example Task & Evaluation

    Here's a real task from MCP-Atlas that demonstrates how models are evaluated.

    Task Prompt

    I'm researching papers on advertisement effectiveness and comparing it to our own online database advertising data. There's a 2024 paper by jane castleman that deals with ad control effectiveness, can you get me the abstract? I believe it mentions ad locality, for which I will also need to ask you for the date of our campaign with the biggest engagement rate, started during the 2015-2023 period, and its locality.

    Ground Truth Claims

    #

    Claim

    1

    There's a 2024 paper by Jane Castleman with the title "Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems".

    2

    The abstract of the paper with title "Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems" is:

    "Recently, Meta has shifted towards AI-mediated ad targeting mechanisms that do not require advertisers to provide detailed targeting criteria, likely driven by excitement over AI capabilities as well …"

    [truncated for brevity for this webpage]

    3

    There's a tie between three advertising campaigns with an engagement rate of 15%.

    4

    The starting dates of the three winning advertising campaigns are: 2022-06-24, 2019-09-20 and 2017-09-09

    5

    The localities of the three winning advertisement campaigns are: "National", "International" and "International".


    Model Response – GLM 4.5 (Failed)


    Coverage Score: 0.4

    Claim-level Scores: [1, 1, 0, 0, 0] = ⅖ = 0.4

    Reason: The model retrieves the paper title and abstract but fails to get requested information about the users advertising campaign history.

    Model Response:

    I found the 2024 paper by Jane Castleman that you mentioned. Here's the abstract:

    **"Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems"** by Jane Castleman and Aleksandra Korolova (2024):

    "Recently, Meta has shifted towards AI-mediated ad targeting mechanisms that do not require advertisers to provide detailed targeting criteria, likely driven by excitement over AI capabilities as well as new data privacy policies and targeting changes agreed upon in civil rights settlements. At the same time, Meta has touted their ad preference controls as an effective mechanism for users to control the ads they see. Furthermore, Meta markets their targeting explanations as a transparency tool that allows users to understand why they saw certain ads and inform actions to control future ads.

    Our study evaluates the effectiveness of Meta's "See less" ad control and the actionability of ad targeting explanations following the shift to AI-mediated targeting. We conduct a large-scale study, randomly assigning participants to mark "See less" to Body Weight Control or Parenting topics, and collecting the ads and targeting explanations Meta shows to participants before and after the intervention. We find that utilizing the "See less" ad control for the topics we study does not significantly reduce the number of ads shown by Meta on these topics, and that the control is less effective for some users whose demographics are correlated with the topic. Furthermore, we find that the majority of ad targeting explanations for local ads made no reference to location-specific targeting criteria, and did not inform users why ads related to the topics they marked to "See less" of continued to be delivered. We hypothesize that the poor effectiveness of controls and lack of actionability in explanations are the result of the shift to AI-mediated targeting, for which explainability and transparency tools have not yet been developed. Our work thus provides evidence for the need of new methods for transparency and user control, suitable and reflective of increasingly complex AI-mediated ad delivery systems."

    The paper does indeed discuss ad locality, specifically noting that "the majority of ad targeting explanations for local ads made no reference to location-specific targeting criteria."

    However, regarding your internal campaign data from 2015-2023, I searched the available knowledge graph but couldn't find any information about advertising campaigns, engagement rates, or locality data. The knowledge graph appears to contain only construction company information and related personnel data. You may need to access your advertising database directly or provide additional context about where this campaign data might be stored.

    Model Response – GPT-5 (Passed)

    Coverage Score: 1.0

    Claim-level Scores: [1, 1, 1, 1, 1] = 5/5 = 1

    Reason: The model retrieves the paper title, abstract and all requested information about the users past advertising campaigns. It recognizes that there are 3 campaigns that fulfill the criteria and provides details for each, instead of assuming to pick just one.

    Model Response:

    - Paper abstract (Castleman, 2024):

    “Recently, Meta has shifted towards AI-mediated ad targeting mechanisms that do not require advertisers to provide detailed targeting criteria, likely driven by excitement over AI capabilities as well as new data privacy policies and targeting changes agreed upon in civil rights settlements. At the same time, Meta has touted their ad preference controls as an effective mechanism for users to control the ads they see. Furthermore, Meta markets their targeting explanations as a transparency tool that allows users to understand why they saw certain ads and inform actions to control future ads.

    Our study evaluates the effectiveness of Meta's "See less" ad control and the actionability of ad targeting explanations following the shift to AI-mediated targeting. We conduct a large-scale study, randomly assigning participants to mark "See less" to Body Weight Control or Parenting topics, and collecting the ads and targeting explanations Meta shows to participants before and after the intervention. We find that utilizing the "See less" ad control for the topics we study does not significantly reduce the number of ads shown by Meta on these topics, and that the control is less effective for some users whose demographics are correlated with the topic. Furthermore, we find that the majority of ad targeting explanations for local ads made no reference to location-specific targeting criteria, and did not inform users why ads related to the topics they marked to "See less" of continued to be delivered. We hypothesize that the poor effectiveness of controls and lack of actionability in explanations are the result of the shift to AI-mediated targeting, for which explainability and transparency tools have not yet been developed. Our work thus provides evidence for the need of new methods for transparency and user control, suitable and reflective of increasingly complex AI-mediated ad delivery systems.”

    - Your top-engagement campaign (started 2015–2023):

    • Max EngagementRate: 15.00

    • Campaign: CAMP9412

    • Start date: 2017-09-09

    • Locality (TargetLocation): International

    Details and sources:

    - Abstract pulled from arXiv (ID: 2408.11910; “Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems,” Jane Castleman; Aleksandra Korolova; published 2024-08-21).

    - Campaign data queried from your Notion “Advertising” database, sorted by EngagementRate and checked for StartDate in 2015–2023.

    - Note: multiple campaigns tie at 15.00; CAMP9412 (Start 2017-09-09, International) is the earliest with the max engagement rate in that window. Others include CAMP1895 (2019-09-20, International) and CAMP9257 (2022-06-24, National).


Loading content...
Last updated: September 19, 2025

Performance Comparison

1

gpt-5-2025-08-07

44.50±1.69

2

o3-pro-2025-06-10-high

43.60±1.56

3

claude-opus-4-1-20250805

40.90±1.49

4

claude-4-Sonnet-20250514

35.60±1.57

4

glm-4p5-air

34.00±1.32

6

kimi-k2-instruct

23.90±1.14

7

qwen3-235B-A22B

12.00±0.94

8

gemini-2.5-Pro-Preview-06-05

8.80±0.62

9

gpt-4o-2024-11-20

7.20±0.51

10

gemini-2.5-flash-preview-05-20

3.40±0.33

11

Llama-4-Maverick

0.80±0.08

Rank (UB): 1 + the number of models whose lower CI bound exceeds this model’s upper CI bound.