Key Metrics at a Glance

62.30% Top Pass Rate | 1,000 Tasks | 40+ MCP Servers | 300+ Tools | 3-6 Tool Calls/Task

MCP-Atlas evaluates how well language models handle real-world tool use through the Model Context Protocol (MCP). Unlike benchmarks that test simple function calling or simulated APIs, MCP-Atlas measures performance on multi-step workflows that require discovering the right tools, calling them correctly, and synthesizing results into accurate answers.

The benchmark addresses a critical gap: while LLMs excel at conversation and reasoning, they struggle with practical tool use. The best models today pass fewer than half of our tasks, revealing massive headroom for improvement in this essential capability for AI agents.

Each task in MCP-Atlas requires several tool calls across real MCP servers running in production-like environments. Models must navigate authentic APIs with actual data, handle errors and retries, and coordinate tools across different servers – exactly what they'll face in real deployments.

Dataset Overview

Scale and Composition

The MCP-Atlas dataset consists of 1,000 human-authored tasks, each requiring multiple tool calls to solve, choosing from 40+ MCP servers and 300+ tools. This scale enables statistically meaningful comparisons while the human authoring ensures quality and realism that synthetic generation cannot match.

MCP Server Distribution

The evaluation includes over 40 MCP servers spanning diverse domains:

Category	Representative servers (examples)	Description
ENV-BASIC	`brave_search, ddg_search, exa, weather, google-maps, fetch`	Search, fetch data, maps, and weather.
ENV-ANALYTICS	`mongodb, airtable, calculator`	Query and aggregate structured data.
ENV-PRODUCTIVITY	`filesystem, notion, slack, google-workspace, arxiv, pubmed`	Work with files, notes, email, and docs.
ENV-FINANCIAL	`twelvedata, alchemy, yfmcp`	Manipulate financial data.
ENV-CODING	`git, github, mcp-code-executor, cli-mcp-server, e2b-server, context7`	Manage repos and run code.

Task Complexity Distribution

Tasks are designed to require multiple tool calls and cross-server orchestration:

65% of tasks require multiple servers: Testing cross-system orchestration capabilities
40% include conditional logic: Requiring dynamic strategy adjustment based on intermediate results
3-6 tool calls per task: Reflecting realistic workflow complexity
12-18 tools exposed per task: Including 3-7 required tools and 5-10 plausible distractors

Tasks range from single-domain queries requiring 2-3 tools with straightforward chaining to complex workflows requiring 5+ tools with conditional branches and error handling.

Distractor Strategy

Each task includes carefully selected distractor tools that are plausible but incorrect choices. These distractors are selected by data annotators from the same category as required tools:

Semantic neighbors: Brave Search vs DuckDuckGo Search vs Exa Search
Similar functionality: MongoDB vs Airtable for data operations
Related but distinct: Git vs GitHub for repository operations

The harness exposes 12-18 tools per task (3-7 required tools plus 5-10 distractors), forcing agents to reason over tool descriptions rather than brute-force invocation.

MCP Environment

Tasks run against real MCP servers hosted in Docker containers, not mocked endpoints or simulations. This ensures models face authentic challenges including actual API latency, real error messages, and genuine data formats.

Infrastructure:

40+ MCP servers run in a standalone Docker container
Controlled tool exposure per task (12-18 tools including targets and distractors)
Complete execution trace logging for diagnostic analysis

Data Sources:

The benchmark includes both stateful and stateless servers to test different interaction patterns:

Real-World Datasets: Servers like Notion, Airtable, MongoDB, and filesystem are populated with authentic datasets that Scale owns - including project documentation, CRM records and inventory data.
Synthetic Datasets: Servers like Slack and Google Calendar are seeded with synthetic data - conversation histories and meeting schedules that simulate typical organizational usage patterns.

Coming Soon: We'll be open-sourcing a complete benchmark dataset, MCP environment with evaluation harness, and research paper, separate from our private dataset that we use on this leaderboard to prevent contamination.

Evaluation Methodology

Task Design Principles

Every task in MCP-Atlas is designed to be solvable only through tool use. Models cannot succeed by relying on training data or general knowledge; they must actually discover appropriate tools, call them with correct parameters, and ground their answers in the results.

Tasks follow strict authoring guidelines:

Tool-dependent: The answer cannot be determined without calling specific tools, while ensuring the prompt doesn’t telegraph which tools to use
Time-invariant: Tasks remain valid over time with stable ground-truth answers
Unambiguous: Each task has exactly one correct answer that can be objectively verified

A typical task might ask:

“Look up Microsoft's IPO price in 1986, find the IPO prices of Apple, Amazon, and Google, then calculate which company had the lowest IPO price and by what percentage it was lower than the highest.”

This requires data retrieval via financial tools and search engines followed by arithmetic computation – all common in real applications but challenging to orchestrate correctly.

Scoring Framework

MCP-Atlas uses an evaluation approach that prioritizes end-task success while capturing diagnostic information about tool use patterns.

Pass Rate

The metric that determines leaderboard ranking is the pass rate - the percentage of all tasks where the model produces a sufficiently correct final answer.

We evaluate the model's response against a ground-truth answer that is split into a list of claims for easier verification via an LLM judge. To allow partial credit, each claim can be given a score of 1, 0.5 or 0.

1.0: Fully correct
0.5: Partially correct
0: Incorrect or missing

We define coverage as the average claim score for the task. For example, if a ground-truth answer contains 4 claims and the model's response receives scores of [1, 0.5, 0.5, 0], its coverage would be:

(1 + 0.5 + 0.5 + 0) / 4 = 50%

A task passes if the model achieves a coverage score of 75% or higher.

Since 50% < 75%, this task would be marked as failed.

We evaluated each model in its default configuration without adding any behavior-shaping system prompt. Each leaderboard model was provided with the user prompt and the associated set of tools as input.

Judging Protocol

We score each task by decomposing the ground-truth answer into constituent claims and evaluating the model’s final response against those claims.

Judge model. Gemini-2.5-Pro, temperature 0.0.

One judgment per claim. For a task with N claims, the judge runs N times.

What the judge sees.

The original task prompt
The task’s claim (one at a time)
The final answer from the model being evaluated
The claim’s scoring rubric (0 / 0.5 / 1 guidance)

What the judge returns.

A categorization in {incorrect, partially_correct, correct} for each claim, which is converted into a numerical value mapping to {0, 0.5, 1}
A brief, free-text justification for the score

Aggregation to task score.

Coverage = mean of per-claim scores for the task
Pass = 1 if Coverage ≥ 0.75, else 0
Pass Rate (leaderboard metric) = mean of Pass across all tasks

Key Findings

Performance Overview

Current results reveal that frontier models struggle with real-world tool use:

Top performer: GPT-5 with a 44.5% pass rate and 61.75% average coverage
Interquartile range (IQR):
- Pass Rate: 8.0% – 38.3%
- Coverage: 28.8% – 61.2%

Interquartile range (IQR) is the spread of the middle 50% of results – the range from the 25th to the 75th percentile. Here, it means half of models score between 8% and 38.3% pass rate.

A Note on Coverage

Coverage measures how close a model’s answers get to the ground-truth across all tasks, including failures. As defined before, we compute per-task coverage via claims-level scoring (0 / 0.5 / 1) and then average across tasks. Coverage is useful for analysis, for example in distinguishing models that narrowly miss passes from those that are far off.

As shown in the table below, coverage and pass rate are tightly related. Usually the higher the coverage, the higher the pass rate.

Model	Pass Rate	Average Coverage
claude-opus-4-5-20251101	62.30%	78.5%
gemini-3-pro-preview	54.10%	73.2%
GPT-5	44.50%	61.75%
claude-4-5-sonnet	43.80%	62.17%
O3	43.60%	66.91%
Claude Opus 4.1	40.90%	64.99%
Claude Sonnet 4	35.60%	57.35%
GLM 4.5 Air*	34.00%	60.59%
Kimi K2 Instruct*	23.90%	50.41%
Qwen3-235B-A22B*	12.00%	29.06%
Gemini 2.5 Pro	8.80%	30.77%
GPT-4o	7.20%	28.53%
Gemini 2.5 Flash	3.40%	17.83%
Llama 4 Maverick*	0.80%	13.03%

*Evaluations for these models were run using Fireworks AI for inference

Failure Analysis

We categorized every failed task to understand where models struggle most:

Failure Category	% of All Task Failures*	Primary Issues
Tool Usage	34-52%	Wrong tool selection, incorrect parameters, schema violations, improper sequencing
Task Understanding	23-38%	Premature stopping, missing subgoals, incomplete requirement interpretation
Response Quality	3-8%	Inaccurate synthesis despite correct tool use, hallucinated information
Logical Errors	1-2%	Flawed reasoning about tool outputs, incorrect conditional logic

*Percentage of All Task Failures: This range represents failures that belong in that category across all models tested

Critical Failure Patterns

The table above groups errors at a high level. Below, we surface the most common sub-patterns that roll up into those categories, using averages across all models.

Category → Pattern	Avg. share of failures	What it looks like (at a glance)
Tool Usage → No tools called	36.31%	No tool call trajectory; gives up without realizing tools are available to solve the task.
Task Understanding → Partial task completion	29.75%	Starts correctly but stops early; answers only part of a multi-step/multi-part prompt (not syntactic/tool errors).
Tool Usage → Incorrect tool parameters	12.75%	Right tool, wrong/missing args; required vs optional confusion; type/format mismatches (dates, units, locales).
Response Quality → Incorrect conclusion	12.26%	Intermediate steps are correct, but final synthesis/claim is wrong (misinterpreted results).

Below is a sample failure categorization for GPT-5.

Failure mode	% of GPT-5 failures
Partial task completion	35.07%
No tools called	20.91%
Incorrect conclusion	16.77%
Incorrect tool parameters	15.03%
Skipped steps	7.69%
Missing tool calls	2.37%
Misunderstood task	1.38%
Insufficient exploration	0.79%
Capability discovery missing	0.00%
Environment constraint violation	0.00%
Wrong tool selection	0.00%

Across 1,000 private tasks, the best models pass less than 50% of tasks. The dominant failure modes are tool-use mistakes and premature stopping, not raw knowledge gaps. Together, these findings point to bottlenecks in reliable tool discovery, parameterization, and multi-step completion.

Example Task & Evaluation

Here's a real task from MCP-Atlas that demonstrates how models are evaluated.

Task Prompt

I'm researching papers on advertisement effectiveness and comparing it to our own online database advertising data. There's a 2024 paper by jane castleman that deals with ad control effectiveness, can you get me the abstract? I believe it mentions ad locality, for which I will also need to ask you for the date of our campaign with the biggest engagement rate, started during the 2015-2023 period, and its locality.

Ground Truth Claims

#	Claim
1	There's a 2024 paper by Jane Castleman with the title "Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems".
2	The abstract of the paper with title "Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems" is: "Recently, Meta has shifted towards AI-mediated ad targeting mechanisms that do not require advertisers to provide detailed targeting criteria, likely driven by excitement over AI capabilities as well …" [truncated for brevity for this webpage]
3	There's a tie between three advertising campaigns with an engagement rate of 15%.
4	The starting dates of the three winning advertising campaigns are: 2022-06-24, 2019-09-20 and 2017-09-09
5	The localities of the three winning advertisement campaigns are: "National", "International" and "International".

Model Response – GLM 4.5 (Failed)

Coverage Score: 0.4

Claim-level Scores: [1, 1, 0, 0, 0] = ⅖ = 0.4

Reason: The model retrieves the paper title and abstract but fails to get requested information about the users advertising campaign history.

Model Response:

I found the 2024 paper by Jane Castleman that you mentioned. Here's the abstract:

**"Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems"** by Jane Castleman and Aleksandra Korolova (2024):

"Recently, Meta has shifted towards AI-mediated ad targeting mechanisms that do not require advertisers to provide detailed targeting criteria, likely driven by excitement over AI capabilities as well as new data privacy policies and targeting changes agreed upon in civil rights settlements. At the same time, Meta has touted their ad preference controls as an effective mechanism for users to control the ads they see. Furthermore, Meta markets their targeting explanations as a transparency tool that allows users to understand why they saw certain ads and inform actions to control future ads.

Our study evaluates the effectiveness of Meta's "See less" ad control and the actionability of ad targeting explanations following the shift to AI-mediated targeting. We conduct a large-scale study, randomly assigning participants to mark "See less" to Body Weight Control or Parenting topics, and collecting the ads and targeting explanations Meta shows to participants before and after the intervention. We find that utilizing the "See less" ad control for the topics we study does not significantly reduce the number of ads shown by Meta on these topics, and that the control is less effective for some users whose demographics are correlated with the topic. Furthermore, we find that the majority of ad targeting explanations for local ads made no reference to location-specific targeting criteria, and did not inform users why ads related to the topics they marked to "See less" of continued to be delivered. We hypothesize that the poor effectiveness of controls and lack of actionability in explanations are the result of the shift to AI-mediated targeting, for which explainability and transparency tools have not yet been developed. Our work thus provides evidence for the need of new methods for transparency and user control, suitable and reflective of increasingly complex AI-mediated ad delivery systems."

The paper does indeed discuss ad locality, specifically noting that "the majority of ad targeting explanations for local ads made no reference to location-specific targeting criteria."

However, regarding your internal campaign data from 2015-2023, I searched the available knowledge graph but couldn't find any information about advertising campaigns, engagement rates, or locality data. The knowledge graph appears to contain only construction company information and related personnel data. You may need to access your advertising database directly or provide additional context about where this campaign data might be stored.

Model Response – GPT-5 (Passed)

Coverage Score: 1.0

Claim-level Scores: [1, 1, 1, 1, 1] = 5/5 = 1

Reason: The model retrieves the paper title, abstract and all requested information about the users past advertising campaigns. It recognizes that there are 3 campaigns that fulfill the criteria and provides details for each, instead of assuming to pick just one.

Model Response:

- Paper abstract (Castleman, 2024):

“Recently, Meta has shifted towards AI-mediated ad targeting mechanisms that do not require advertisers to provide detailed targeting criteria, likely driven by excitement over AI capabilities as well as new data privacy policies and targeting changes agreed upon in civil rights settlements. At the same time, Meta has touted their ad preference controls as an effective mechanism for users to control the ads they see. Furthermore, Meta markets their targeting explanations as a transparency tool that allows users to understand why they saw certain ads and inform actions to control future ads.

- Your top-engagement campaign (started 2015–2023):

• Max EngagementRate: 15.00

• Campaign: CAMP9412

• Start date: 2017-09-09

• Locality (TargetLocation): International

Details and sources:

- Abstract pulled from arXiv (ID: 2408.11910; “Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems,” Jane Castleman; Aleksandra Korolova; published 2024-08-21).

- Campaign data queried from your Notion “Advertising” database, sorted by EngagementRate and checked for StartDate in 2015–2023.

- Note: multiple campaigns tie at 15.00; CAMP9412 (Start 2017-09-09, International) is the earliest with the max engagement rate in that window. Others include CAMP1895 (2019-09-20, International) and CAMP9257 (2022-06-24, National).

Key Metrics at a Glance

62.30% Top Pass Rate | 1,000 Tasks | 40+ MCP Servers | 300+ Tools | 3-6 Tool Calls/Task

Dataset Overview

Scale and Composition

MCP Server Distribution

The evaluation includes over 40 MCP servers spanning diverse domains:

Category	Representative servers (examples)	Description
ENV-BASIC	`brave_search, ddg_search, exa, weather, google-maps, fetch`	Search, fetch data, maps, and weather.
ENV-ANALYTICS	`mongodb, airtable, calculator`	Query and aggregate structured data.
ENV-PRODUCTIVITY	`filesystem, notion, slack, google-workspace, arxiv, pubmed`	Work with files, notes, email, and docs.
ENV-FINANCIAL	`twelvedata, alchemy, yfmcp`	Manipulate financial data.
ENV-CODING	`git, github, mcp-code-executor, cli-mcp-server, e2b-server, context7`	Manage repos and run code.

Task Complexity Distribution

Tasks are designed to require multiple tool calls and cross-server orchestration:

65% of tasks require multiple servers: Testing cross-system orchestration capabilities
40% include conditional logic: Requiring dynamic strategy adjustment based on intermediate results
3-6 tool calls per task: Reflecting realistic workflow complexity
12-18 tools exposed per task: Including 3-7 required tools and 5-10 plausible distractors

Tasks range from single-domain queries requiring 2-3 tools with straightforward chaining to complex workflows requiring 5+ tools with conditional branches and error handling.

Distractor Strategy

Each task includes carefully selected distractor tools that are plausible but incorrect choices. These distractors are selected by data annotators from the same category as required tools:

Semantic neighbors: Brave Search vs DuckDuckGo Search vs Exa Search
Similar functionality: MongoDB vs Airtable for data operations
Related but distinct: Git vs GitHub for repository operations

The harness exposes 12-18 tools per task (3-7 required tools plus 5-10 distractors), forcing agents to reason over tool descriptions rather than brute-force invocation.

MCP Environment

Infrastructure:

40+ MCP servers run in a standalone Docker container
Controlled tool exposure per task (12-18 tools including targets and distractors)
Complete execution trace logging for diagnostic analysis

Data Sources:

The benchmark includes both stateful and stateless servers to test different interaction patterns:

Real-World Datasets: Servers like Notion, Airtable, MongoDB, and filesystem are populated with authentic datasets that Scale owns - including project documentation, CRM records and inventory data.
Synthetic Datasets: Servers like Slack and Google Calendar are seeded with synthetic data - conversation histories and meeting schedules that simulate typical organizational usage patterns.

Evaluation Methodology

Task Design Principles

Tasks follow strict authoring guidelines:

Tool-dependent: The answer cannot be determined without calling specific tools, while ensuring the prompt doesn’t telegraph which tools to use
Time-invariant: Tasks remain valid over time with stable ground-truth answers
Unambiguous: Each task has exactly one correct answer that can be objectively verified

A typical task might ask:

This requires data retrieval via financial tools and search engines followed by arithmetic computation – all common in real applications but challenging to orchestrate correctly.

Scoring Framework

MCP-Atlas uses an evaluation approach that prioritizes end-task success while capturing diagnostic information about tool use patterns.

Pass Rate

The metric that determines leaderboard ranking is the pass rate - the percentage of all tasks where the model produces a sufficiently correct final answer.

1.0: Fully correct
0.5: Partially correct
0: Incorrect or missing

(1 + 0.5 + 0.5 + 0) / 4 = 50%

A task passes if the model achieves a coverage score of 75% or higher.

Since 50% < 75%, this task would be marked as failed.

Judging Protocol

We score each task by decomposing the ground-truth answer into constituent claims and evaluating the model’s final response against those claims.

Judge model. Gemini-2.5-Pro, temperature 0.0.

One judgment per claim. For a task with N claims, the judge runs N times.

What the judge sees.

The original task prompt
The task’s claim (one at a time)
The final answer from the model being evaluated
The claim’s scoring rubric (0 / 0.5 / 1 guidance)

What the judge returns.

A categorization in {incorrect, partially_correct, correct} for each claim, which is converted into a numerical value mapping to {0, 0.5, 1}
A brief, free-text justification for the score

Aggregation to task score.

Coverage = mean of per-claim scores for the task
Pass = 1 if Coverage ≥ 0.75, else 0
Pass Rate (leaderboard metric) = mean of Pass across all tasks

Key Findings

Performance Overview

Current results reveal that frontier models struggle with real-world tool use:

Top performer: GPT-5 with a 44.5% pass rate and 61.75% average coverage
Interquartile range (IQR):
- Pass Rate: 8.0% – 38.3%
- Coverage: 28.8% – 61.2%

Interquartile range (IQR) is the spread of the middle 50% of results – the range from the 25th to the 75th percentile. Here, it means half of models score between 8% and 38.3% pass rate.

A Note on Coverage

As shown in the table below, coverage and pass rate are tightly related. Usually the higher the coverage, the higher the pass rate.

Model	Pass Rate	Average Coverage
claude-opus-4-5-20251101	62.30%	78.5%
gemini-3-pro-preview	54.10%	73.2%
GPT-5	44.50%	61.75%
claude-4-5-sonnet	43.80%	62.17%
O3	43.60%	66.91%
Claude Opus 4.1	40.90%	64.99%
Claude Sonnet 4	35.60%	57.35%
GLM 4.5 Air*	34.00%	60.59%
Kimi K2 Instruct*	23.90%	50.41%
Qwen3-235B-A22B*	12.00%	29.06%
Gemini 2.5 Pro	8.80%	30.77%
GPT-4o	7.20%	28.53%
Gemini 2.5 Flash	3.40%	17.83%
Llama 4 Maverick*	0.80%	13.03%

*Evaluations for these models were run using Fireworks AI for inference

Failure Analysis

We categorized every failed task to understand where models struggle most:

Failure Category	% of All Task Failures*	Primary Issues
Tool Usage	34-52%	Wrong tool selection, incorrect parameters, schema violations, improper sequencing
Task Understanding	23-38%	Premature stopping, missing subgoals, incomplete requirement interpretation
Response Quality	3-8%	Inaccurate synthesis despite correct tool use, hallucinated information
Logical Errors	1-2%	Flawed reasoning about tool outputs, incorrect conditional logic

*Percentage of All Task Failures: This range represents failures that belong in that category across all models tested

Critical Failure Patterns

The table above groups errors at a high level. Below, we surface the most common sub-patterns that roll up into those categories, using averages across all models.

Category → Pattern	Avg. share of failures	What it looks like (at a glance)
Tool Usage → No tools called	36.31%	No tool call trajectory; gives up without realizing tools are available to solve the task.
Task Understanding → Partial task completion	29.75%	Starts correctly but stops early; answers only part of a multi-step/multi-part prompt (not syntactic/tool errors).
Tool Usage → Incorrect tool parameters	12.75%	Right tool, wrong/missing args; required vs optional confusion; type/format mismatches (dates, units, locales).
Response Quality → Incorrect conclusion	12.26%	Intermediate steps are correct, but final synthesis/claim is wrong (misinterpreted results).

Below is a sample failure categorization for GPT-5.

Failure mode	% of GPT-5 failures
Partial task completion	35.07%
No tools called	20.91%
Incorrect conclusion	16.77%
Incorrect tool parameters	15.03%
Skipped steps	7.69%
Missing tool calls	2.37%
Misunderstood task	1.38%
Insufficient exploration	0.79%
Capability discovery missing	0.00%
Environment constraint violation	0.00%
Wrong tool selection	0.00%

Example Task & Evaluation

Here's a real task from MCP-Atlas that demonstrates how models are evaluated.

Task Prompt

Ground Truth Claims

#	Claim
1	There's a 2024 paper by Jane Castleman with the title "Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems".
2	The abstract of the paper with title "Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems" is: "Recently, Meta has shifted towards AI-mediated ad targeting mechanisms that do not require advertisers to provide detailed targeting criteria, likely driven by excitement over AI capabilities as well …" [truncated for brevity for this webpage]
3	There's a tie between three advertising campaigns with an engagement rate of 15%.
4	The starting dates of the three winning advertising campaigns are: 2022-06-24, 2019-09-20 and 2017-09-09
5	The localities of the three winning advertisement campaigns are: "National", "International" and "International".

Model Response – GLM 4.5 (Failed)

Coverage Score: 0.4

Claim-level Scores: [1, 1, 0, 0, 0] = ⅖ = 0.4

Reason: The model retrieves the paper title and abstract but fails to get requested information about the users advertising campaign history.

Model Response:

I found the 2024 paper by Jane Castleman that you mentioned. Here's the abstract:

**"Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems"** by Jane Castleman and Aleksandra Korolova (2024):

The paper does indeed discuss ad locality, specifically noting that "the majority of ad targeting explanations for local ads made no reference to location-specific targeting criteria."

Model Response – GPT-5 (Passed)

Coverage Score: 1.0

Claim-level Scores: [1, 1, 1, 1, 1] = 5/5 = 1

Model Response:

- Paper abstract (Castleman, 2024):

- Your top-engagement campaign (started 2015–2023):

• Max EngagementRate: 15.00

• Campaign: CAMP9412

• Start date: 2017-09-09

• Locality (TargetLocation): International

Details and sources:

- Campaign data queried from your Notion “Advertising” database, sorted by EngagementRate and checked for StartDate in 2015–2023.

MCP Atlas

Key Metrics at a Glance

Dataset Overview

Scale and Composition

MCP Server Distribution

Task Complexity Distribution

Distractor Strategy

MCP Environment

Evaluation Methodology

Task Design Principles

Scoring Framework

Pass Rate

Judging Protocol

Key Findings

Performance Overview

A Note on Coverage

Failure Analysis

Critical Failure Patterns

Example Task & Evaluation

Task Prompt

Ground Truth Claims

Model Response – GLM 4.5 (Failed)

Model Response – GPT-5 (Passed)

Performance Comparison

MCP Atlas

Key Metrics at a Glance

Dataset Overview

Scale and Composition

MCP Server Distribution

Task Complexity Distribution

Distractor Strategy

MCP Environment

Evaluation Methodology

Task Design Principles

Scoring Framework

Pass Rate

Judging Protocol

Key Findings

Performance Overview

A Note on Coverage

Failure Analysis

Critical Failure Patterns

Example Task & Evaluation

Task Prompt

Ground Truth Claims

Model Response – GLM 4.5 (Failed)

Model Response – GPT-5 (Passed)

Performance Comparison