December 22, 2025
MoReBench: Evaluating the Process of AI Moral Reasoning
MoReBench is a large-scale benchmark for evaluating AI moral reasoning beyond final outcomes. Instead of scoring answers alone, it assesses the intermediate reasoning traces models produce when navigating 1,000 morally ambiguous, real-world scenarios. Our findings show that moral reasoning is a distinct and underdeveloped capability, largely uncorrelated with performance on traditional math and coding benchmarks.
Read more
December 19, 2025
Open-Sourcing MCP-Atlas: A Benchmark for Real Tool Use
We’re open-sourcing MCP-Atlas, including the dataset, evaluation environment, and updated results for a benchmark designed to measure how reliably AI agents use real tools. MCP-Atlas evaluates realistic, multi-step workflows that run against real Model Context Protocol servers, exposing where agents succeed—and where they still fail—when tool discovery, parameterization, and execution must work together.
Read more
December 18, 2025
Real Speech Breaks AI (And What We're Doing to Fix It)
Audio MultiChallenge is a new benchmark designed to stress-test native Speech-to-Speech models on what actually makes voice hard: mid-sentence corrections, audio-only cues, instruction drift, and long-horizon self-consistency. By evaluating real human conversations rather than synthetic text-to-speech we uncover where current audio systems still fail, and what it will take to build voice agents that truly listen.
Read more
November 19, 2025
The Limits of Data Filtering in Bio-Foundation Models
In collaboration with Princeton University, UMD, SecureBio, and the Center for AI Safety, we introduce BioRiskEval, the first comprehensive framework for assessing dual-use risks in open-weight bio-foundation models. Our stress tests on the Evo 2 model reveal a critical vulnerability: dangerous knowledge removed via data filtering often persists in hidden layers or can be rapidly restored with minimal compute. These findings challenge the reliance on simple data curation and underscore the urgent need for "defense-in-depth" strategies to secure the future of biological AI.
Read more
November 14, 2025
Breaking Out of the Lab: Testing AI in Professional Domains
AI excels on academic tests, but it fails at real professional jobs. That's the stark finding from PRBench, our new benchmark series designed to move AI testing out of the lab and into the real world. We're launching the series with two of the most complex domains: Law and Finance. Using 1,100 high-stakes tasks sourced from 182 professionals, we tested how today's frontier models handle the nuanced, high-stakes reasoning that defines these fields. While models are great at following instructions, they fail at the expert judgment, auditable reasoning, and deep diligence required for tasks with real economic consequences.
Read more
October 29, 2025
The Remote Labor Index: Measuring the Automation of Work
Can AI actually automate complex, professional jobs? The new Remote Labor Index (RLI) from Scale and the Center for AI Safety (CAIS) provides the first data-driven answer. By testing AI agents against 240 real-world, paid freelance projects, the RLI found that the best-performing agents could only successfully automate 2.5% of them. This new benchmark reveals a critical gap between AI's generative skill and the end-to-end reliability required for professional work, showing the immediate impact is augmentation, not mass automation.
Read more
October 16, 2025
VisualToolBench: Testing the Limits of AI Vision
Our new benchmark, VisualToolBench, reveals a striking limitation in today's most advanced AI: models are much better at "thinking about images" than "thinking with them." While AI can describe what's in a picture, it fails when asked to manipulate an image by cropping, editing, or enhancing it to solve complex, real-world problems. The results are stark, with no model scoring above 19% correctness. Dive into our breakdown of why even the best models fail, what this reveals about the core bottleneck of visual perception, and how these findings create a new roadmap for the future of AI.
Read more
September 19, 2025
SWE-Bench Pro: Raising the Bar for Agentic Coding
Benchmarks play a critical role in measuring the progress of AI coding agents, but most fall short by relying on contaminated training data, oversimplified bug fixes, or narrow task coverage. SWE-Bench Pro solves these problems with contamination-resistant repositories, diverse and industrially relevant codebases, and human-in-the-loop curation that preserves real-world difficulty. With reproducible, end-to-end evaluation, SWE-Bench Pro sets a new gold standard for testing advanced AI developers.
Read more
September 19, 2025
Advancing Agents: Introducing Scale’s Agentic Leaderboards
While today's agents show promise, the benchmarks used to evaluate them often test simple, isolated skills that don't reflect real-world work. To close this gap, Scale is launching a new suite of evaluations designed to measure an agent's ability to perform complex, end-to-end tasks. Our first two leaderboards set a new, more difficult standard for the industry. SWE-Bench Pro challenges agents with professional software engineering tasks in complex, proprietary codebases they've never seen before. MCP Atlas measures an agent's ability to skillfully orchestrate over 300 real-world digital tools to solve a single problem. Read the full post to learn about our framework for building a more reliable yardstick for the future of AI.
Read more
September 19, 2025
Actions, Not Words: MCP-Atlas Raises the Bar for Agentic Evaluation
MCP-Atlas is a real-world leaderboard for agentic tool use via the Model Context Protocol. It runs 1,000 single-turn tasks across 40+ servers and 300+ tools—search, databases, filesystems, APIs, and dev tools—each requiring 3–6 calls with distractors. We score exact-answer pass rate and provide diagnostics. Early results: even the top model completes less than half of tasks, with failures concentrated in tool selection, parameter construction, and orchestration. Built for model and product teams, MCP-Atlas pinpoints what to fix.
Read more