
The promise of agentic AI is end-to-end automation across our entire digital infrastructure. Today’s AI agents, however, aren’t yet capable of completing the long, complex tasks that humans do on a daily basis. To help close that gap, Scale AI is launching a series of agentic leaderboards, with new benchmarks that evaluate agent performance in complex, real-world environments. We’re starting with SWE-Bench Pro and MCP Atlas.
In this post, we’ll explore agent evaluation, the challenge of replicating digital environments, and what’s to come for Scale.
The quality of an agent is determined by examining its ability to demonstrate foundational skills and execute complete, end-to-end tasks within digital environments.
Foundational skills are the universal building blocks of agent intelligence, with current R&D focused on three core areas: tool use, coding, and GUI interaction. Existing benchmarks typically isolate these skills. While essential, isolated-skill evaluations alone don’t capture real-world work. To truly measure an agent’s capability, evaluation must advance by making foundational-skill tests far more realistic and rigorous, while also assessing an agent’s ability to integrate those skills to complete complex, end-to-end tasks.
An agent focused on this dimension must intelligently combine its capabilities, such as using tools, editing code, and navigating UIs, to achieve a single, complex objective. Our new benchmarks, starting with SWE-Bench Pro and MCP Atlas, tackle the challenge of foundational skills, while future benchmarks will focus on end-to-end task completion.
The fundamental constraint in evaluating agents is the digital environment, and this applies to both foundational and end-to-end tasks. For foundational tasks, the challenge is building realistic, high‑complexity environments in focused domains—for example, a complex codebase or a limited set of tools—that reflect real work. Many existing benchmarks are too simple or artificial. Our new benchmarks make these domains more realistic and challenging.
For true end-to-end tasks, the challenge expands to environments that span multiple systems in a single workflow. Preparing a business report, for example, may require pulling data from a CRM, checking a chat app for context, and compiling the results in an email. Evaluating such workflows requires realistic simulations of a professional’s software stack. Over time, Scale’s benchmarks on the agentic leaderboards will extend into these multi‑system environments, mirroring an enterprise technology stack and providing more comprehensive coverage of cross‑functional, end‑to‑end tasks.
The direction for agents is clear: longer tasks, more complex environments, and greater economic value. As agentic capabilities continue to expand, evaluating them rigorously at every step is essential for turning their immense promise into a viable reality. We’re excited to launch our new benchmarks and agentic leaderboards in the coming days. Keep an eye on our SEAL Leaderboards for more details.