April 16, 2026

Can AI turn a novice into a biosecurity expert? Our study with SecureBio found that non-experts using LLMs outperformed trained biologists, and current safeguards barely slowed them down.
Read more
March 4, 2026

As LLMs and coding agents take on professional coding work, evaluations must assess them like junior engineers: by how they investigate a system, gather evidence, and explain what they’re observing. SWE Atlas, the first evaluation suite of its kind, does just that.
Read more
December 22, 2025

MoReBench is a large-scale benchmark for evaluating AI moral reasoning beyond final outcomes. Instead of scoring answers alone, it assesses the intermediate reasoning traces models produce when navigating 1,000 morally ambiguous, real-world scenarios. Our findings show that moral reasoning is a distinct and underdeveloped capability, largely uncorrelated with performance on traditional math and coding benchmarks.
Read more
December 19, 2025

We’re open-sourcing MCP-Atlas, including the dataset, evaluation environment, and updated results for a benchmark designed to measure how reliably AI agents use real tools. MCP-Atlas evaluates realistic, multi-step workflows that run against real Model Context Protocol servers, exposing where agents succeed—and where they still fail—when tool discovery, parameterization, and execution must work together.
Read more
December 17, 2025

Audio MultiChallenge is a new benchmark designed to stress-test native Speech-to-Speech models on what actually makes voice hard: mid-sentence corrections, audio-only cues, instruction drift, and long-horizon self-consistency. By evaluating real human conversations rather than synthetic text-to-speech we uncover where current audio systems still fail, and what it will take to build voice agents that truly listen.
Read more
November 19, 2025

In collaboration with Princeton University, UMD, SecureBio, and the Center for AI Safety, we introduce BioRiskEval, the first comprehensive framework for assessing dual-use risks in open-weight bio-foundation models. Our stress tests on the Evo 2 model reveal a critical vulnerability: dangerous knowledge removed via data filtering often persists in hidden layers or can be rapidly restored with minimal compute. These findings challenge the reliance on simple data curation and underscore the urgent need for "defense-in-depth" strategies to secure the future of biological AI.
Read more
November 14, 2025

AI excels on academic tests, but it fails at real professional jobs. That's the stark finding from PRBench, our new benchmark series designed to move AI testing out of the lab and into the real world. We're launching the series with two of the most complex domains: Law and Finance. Using 1,100 high-stakes tasks sourced from 182 professionals, we tested how today's frontier models handle the nuanced, high-stakes reasoning that defines these fields. While models are great at following instructions, they fail at the expert judgment, auditable reasoning, and deep diligence required for tasks with real economic consequences.
Read more
October 29, 2025

Can AI actually automate complex, professional jobs? The new Remote Labor Index (RLI) from Scale and the Center for AI Safety (CAIS) provides the first data-driven answer. By testing AI agents against 240 real-world, paid freelance projects, the RLI found that the best-performing agents could only successfully automate 2.5% of them. This new benchmark reveals a critical gap between AI's generative skill and the end-to-end reliability required for professional work, showing the immediate impact is augmentation, not mass automation.
Read more
October 16, 2025

Our new benchmark, VisualToolBench, reveals a striking limitation in today's most advanced AI: models are much better at "thinking about images" than "thinking with them." While AI can describe what's in a picture, it fails when asked to manipulate an image by cropping, editing, or enhancing it to solve complex, real-world problems. The results are stark, with no model scoring above 19% correctness. Dive into our breakdown of why even the best models fail, what this reveals about the core bottleneck of visual perception, and how these findings create a new roadmap for the future of AI.
Read more
September 19, 2025

Benchmarks play a critical role in measuring the progress of AI coding agents, but most fall short by relying on contaminated training data, oversimplified bug fixes, or narrow task coverage. SWE-Bench Pro solves these problems with contamination-resistant repositories, diverse and industrially relevant codebases, and human-in-the-loop curation that preserves real-world difficulty. With reproducible, end-to-end evaluation, SWE-Bench Pro sets a new gold standard for testing advanced AI developers.
Read more
September 19, 2025

While today's agents show promise, the benchmarks used to evaluate them often test simple, isolated skills that don't reflect real-world work. To close this gap, Scale is launching a new suite of evaluations designed to measure an agent's ability to perform complex, end-to-end tasks. Our first two leaderboards set a new, more difficult standard for the industry. SWE-Bench Pro challenges agents with professional software engineering tasks in complex, proprietary codebases they've never seen before. MCP Atlas measures an agent's ability to skillfully orchestrate over 300 real-world digital tools to solve a single problem. Read the full post to learn about our framework for building a more reliable yardstick for the future of AI.
Read more
September 19, 2025

MCP-Atlas is a real-world leaderboard for agentic tool use via the Model Context Protocol. It runs 1,000 single-turn tasks across 40+ servers and 300+ tools—search, databases, filesystems, APIs, and dev tools—each requiring 3–6 calls with distractors. We score exact-answer pass rate and provide diagnostics. Early results: even the top model completes less than half of tasks, with failures concentrated in tool selection, parameter construction, and orchestration. Built for model and product teams, MCP-Atlas pinpoints what to fix.
Read more
September 12, 2025

Can an AI be a great tutor? TutorBench is a new, challenging benchmark from Scale designed to find out. Moving beyond right or wrong answers, it grades today's leading AI models on their ability to actually teach: evaluating crucial skills like adaptive explanation, constructive feedback, and active learning support. Using 1,500 multimodal conversations across STEM subjects, many including images of handwritten work, TutorBench reveals that even the most advanced models still have a long way to go to master the nuanced art of tutoring, paving the way for the next generation of AI in education.
Read more
September 2, 2025

How do you know if an AI model is actually learning, or just getting better at faking it? A new paper from researchers at Scale introduces Rubrics as Rewards (RaR), a framework that solves this problem by training models with structured, expert-designed checklists instead of simple preference scores. This approach moves the human role from a simple preference labeler to an expert architect of the AI's values, resulting in up to a 28% performance leap on challenging benchmarks and providing a more transparent, effective path toward reliable AI.
Read more
July 23, 2025

Building truly intelligent and equitable multilingual AI requires a new way to measure cultural reasoning. Scale's new Multilingual Native Reasoning Challenge (MultiNRC) is designed to do just that. Created from scratch by native speakers, this benchmark tests for deep linguistic and cultural understanding beyond simple translation, providing a clear path for the AI community to accelerate progress.
Read more
July 23, 2025

As AI agents become more powerful, ensuring their safety is the most critical challenge for deployment. This post explores WebGuard, a new benchmark from researchers at Scale, UC Berkeley, and The Ohio State University that reveals a significant safety gap in current models. Learn how high-quality, human-in-the-loop data provides a path forward, dramatically improving a model's ability to avoid risky behavior.
Read more
June 9, 2025

At Scale, operations, engineering, and research teams work together to ensure the quality of our data. To do this, we rely on a combination of human review, automated linters, data distribution analyses, and model training experiments. In this post, we will focus on the last category and introduce Precog, our platform for running data quality experiments by training models on our own datasets.
Read more
June 5, 2025

As advanced AI rapidly evolves, red teaming needs an updated approach. Scale researchers propose a shift to test AI systems, not just models, in real-world contexts with a focus on product safety and realistic threats.
Read more