Research

Advancing Agents: Introducing Scale’s Agentic Leaderboards

byonSeptember 19, 2025

The promise of agentic AI is end-to-end automation across our entire digital infrastructure. Today’s AI agents, however, aren’t yet capable of completing the long, complex tasks that humans do on a daily basis. To help close that gap, Scale AI is launching a series of agentic leaderboards, with new benchmarks that evaluate agent performance in complex, real-world environments. We’re starting with SWE-Bench Pro and MCP Atlas.

  • SWE-Bench Pro measures an agent's ability to perform the job of a professional software engineer. It evaluates an agent’s ability to solve challenging, real-world bug fixes and feature requests that require code changes spanning multiple files and averaging over 105 lines. This benchmark uses complex, niche, and proprietary codebases to prevent data contamination and increase realism.

  • MCP Atlas evaluates an agent's practical ability to orchestrate multiple tools to solve real-world problems using MCP servers (a standard interface that delivers data or tools to AI). It challenges agents with complex, realistic tasks that require them to skillfully combine tools from over 40 real servers and over 300 tools, including search engines, databases, and coding environments.

In this post, we’ll explore agent evaluation, the challenge of replicating digital environments, and what’s to come for Scale.

Agent Evaluation: Foundational Skills and End-to-End Task Completion

The quality of an agent is determined by examining its ability to demonstrate foundational skills and execute complete, end-to-end tasks within digital environments. 

Foundational skills are the universal building blocks of agent intelligence, with current R&D focused on three core areas: tool use, coding, and GUI interaction. Existing benchmarks typically isolate these skills. While essential, isolated-skill evaluations alone don’t capture real-world work. To truly measure an agent’s capability, evaluation must advance by making foundational-skill tests far more realistic and rigorous, while also assessing an agent’s ability to integrate those skills to complete complex, end-to-end tasks

An agent focused on this dimension must intelligently combine its capabilities, such as using tools, editing code, and navigating UIs, to achieve a single, complex objective. Our new benchmarks, starting with SWE-Bench Pro and MCP Atlas, tackle the challenge of foundational skills, while future benchmarks will focus on end-to-end task completion.

The Challenge of Environments

The fundamental constraint in evaluating agents is the digital environment, and this applies to both foundational and end-to-end tasks. For foundational tasks, the challenge is building realistic, high‑complexity environments in focused domains—for example, a complex codebase or a limited set of tools—that reflect real work. Many existing benchmarks are too simple or artificial. Our new benchmarks make these domains more realistic and challenging.

For true end-to-end tasks, the challenge expands to environments that span multiple systems in a single workflow. Preparing a business report, for example, may require pulling data from a CRM, checking a chat app for context, and compiling the results in an email. Evaluating such workflows requires realistic simulations of a professional’s software stack. Over time, Scale’s benchmarks on the agentic leaderboards will extend into these multi‑system environments, mirroring an enterprise technology stack and providing more comprehensive coverage of cross‑functional, end‑to‑end tasks. 

Coming Soon

The direction for agents is clear: longer tasks, more complex environments, and greater economic value. As agentic capabilities continue to expand, evaluating them rigorously at every step is essential for turning their immense promise into a viable reality. We’re excited to launch our new benchmarks and agentic leaderboards in the coming days. Keep an eye on our SEAL Leaderboards for more details.

 


The future of your industry starts here