April 20, 2026

Frontier agents ace tasks with complete specs, then crash to 4% when key details are missing. They never ask for help. HiL-Bench is the first benchmark that tests whether they know when to.
Read more
April 16, 2026

Can AI turn a novice into a biosecurity expert? Our study with SecureBio found that non-experts using LLMs outperformed trained biologists, and current safeguards barely slowed them down.
Read more
March 20, 2026

Voice Showdown is a preference-based benchmark for voice AI models, built on real human speech to measure how models perform across languages and real-world conversations.
Read more
March 4, 2026

As LLMs and coding agents take on professional coding work, evaluations must assess them like junior engineers: by how they investigate a system, gather evidence, and explain what they’re observing. SWE Atlas, the first evaluation suite of its kind, does just that.
Read more
December 22, 2025

MoReBench is a large-scale benchmark for evaluating AI moral reasoning beyond final outcomes. Instead of scoring answers alone, it assesses the intermediate reasoning traces models produce when navigating 1,000 morally ambiguous, real-world scenarios. Our findings show that moral reasoning is a distinct and underdeveloped capability, largely uncorrelated with performance on traditional math and coding benchmarks.
Read more
December 17, 2025

Audio MultiChallenge is a new benchmark designed to stress-test native Speech-to-Speech models on what actually makes voice hard: mid-sentence corrections, audio-only cues, instruction drift, and long-horizon self-consistency. By evaluating real human conversations rather than synthetic text-to-speech we uncover where current audio systems still fail, and what it will take to build voice agents that truly listen.
Read more
November 25, 2025

To measure the propensity of agents to make unsafe choices, Scale, the University of Maryland, and other collaborators developed PropensityBench. This benchmark simulates real-world pressure by allowing agents to choose between a safe approach that consistently fails and a functional, harmful shortcut, revealing their true inclinations. The benchmark reveals that agent safety compromises significantly under pressure.
Read more
November 24, 2025

The next generation of AI agents is shifting from passive workers that receive user commands and generate outputs to active agents that plan, act, observe, and improve on their own. Agents now choose how to complete a task, which tools to use, and whom (or which agent) to collaborate with. LLMs didn’t invent agency, but they democratized it by turning frontier-level reasoning into a simple API call, letting teams compose complex systems from simple building blocks.
Read more
November 19, 2025

In collaboration with Princeton University, UMD, SecureBio, and the Center for AI Safety, we introduce BioRiskEval, the first comprehensive framework for assessing dual-use risks in open-weight bio-foundation models. Our stress tests on the Evo 2 model reveal a critical vulnerability: dangerous knowledge removed via data filtering often persists in hidden layers or can be rapidly restored with minimal compute. These findings challenge the reliance on simple data curation and underscore the urgent need for "defense-in-depth" strategies to secure the future of biological AI.
Read more
November 14, 2025

AI excels on academic tests, but it fails at real professional jobs. That's the stark finding from PRBench, our new benchmark series designed to move AI testing out of the lab and into the real world. We're launching the series with two of the most complex domains: Law and Finance. Using 1,100 high-stakes tasks sourced from 182 professionals, we tested how today's frontier models handle the nuanced, high-stakes reasoning that defines these fields. While models are great at following instructions, they fail at the expert judgment, auditable reasoning, and deep diligence required for tasks with real economic consequences.
Read more
October 29, 2025

Can AI actually automate complex, professional jobs? The new Remote Labor Index (RLI) from Scale and the Center for AI Safety (CAIS) provides the first data-driven answer. By testing AI agents against 240 real-world, paid freelance projects, the RLI found that the best-performing agents could only successfully automate 2.5% of them. This new benchmark reveals a critical gap between AI's generative skill and the end-to-end reliability required for professional work, showing the immediate impact is augmentation, not mass automation.
Read more
October 16, 2025

Our new benchmark, VisualToolBench, reveals a striking limitation in today's most advanced AI: models are much better at "thinking about images" than "thinking with them." While AI can describe what's in a picture, it fails when asked to manipulate an image by cropping, editing, or enhancing it to solve complex, real-world problems. The results are stark, with no model scoring above 19% correctness. Dive into our breakdown of why even the best models fail, what this reveals about the core bottleneck of visual perception, and how these findings create a new roadmap for the future of AI.
Read more
October 7, 2025

Many enterprise problems lack simple yes/no solutions, causing common AI training methods to fall short. Scale’s Rubrics as Rewards (RaR) method solves this by using a detailed, multi-faceted rubric for evaluation instead of a simple reward signal. This approach enables smaller, fine-tuned models to match or outperform much larger, general-purpose models on specialized tasks. For instance, on a legal analysis test set, a small Qwen3-4B model trained with RaR surpassed the performance of the much larger GPT-4.1. For enterprises, this translates directly to lower costs, more transparency, and tighter control, delivering superior performance on the complex workflows that matter most.
Read more
September 14, 2025

A critical challenge in enterprise AI development is the instability of LLM evaluations. Our internal testing revealed that metrics on identical A/B tests can swing by as much as 15% from one day to the next. This level of variance is large enough to invalidate results, making principled, incremental improvement a game of chance. In this post, we dive into the root cause: an industry-wide phenomenon created by the interplay of Sparse Mixture of Experts (MoE) architecture and the batched inference common to provider APIs. By implementing a "cohort of judges," a small panel of LLMs with semantically similar but varied prompts, we successfully reduce this variance by at least 50%. This creates the stable, trustworthy measurement foundation needed to confidently build and improve AI agents.
Read more
September 12, 2025

Can an AI be a great tutor? TutorBench is a new, challenging benchmark from Scale designed to find out. Moving beyond right or wrong answers, it grades today's leading AI models on their ability to actually teach: evaluating crucial skills like adaptive explanation, constructive feedback, and active learning support. Using 1,500 multimodal conversations across STEM subjects, many including images of handwritten work, TutorBench reveals that even the most advanced models still have a long way to go to master the nuanced art of tutoring, paving the way for the next generation of AI in education.
Read more
September 2, 2025

How do you know if an AI model is actually learning, or just getting better at faking it? A new paper from researchers at Scale introduces Rubrics as Rewards (RaR), a framework that solves this problem by training models with structured, expert-designed checklists instead of simple preference scores. This approach moves the human role from a simple preference labeler to an expert architect of the AI's values, resulting in up to a 28% performance leap on challenging benchmarks and providing a more transparent, effective path toward reliable AI.
Read more
August 19, 2025

AI is moving beyond text, toward agents that can listen, speak, and interact naturally with the world. Voice AI requires far more than word; it demands the nuanced tones, emotions, and dynamics of human speech. But unlike text, there’s no vast public library of labeled audio to train on. Scale is building that foundation, delivering high-quality, diverse, and emotionally rich speech data to power every stage of model development. From real-time conversation to multimodal perception, these datasets are unlocking the next era of human-computer interaction. The future is listening.
Read more
August 4, 2025

A fundamental shift is underway in how AI for healthcare is evaluated. Recent studies from OpenAI, Google, and Microsoft move beyond simple accuracy scores to establish a new standard for measuring an AI's healthcare skills. This post provides an analysis of three distinct evaluation methodologies that redefine what "good" looks like for clinical AI. We explore how OpenAI's HealthBench uses a massive, rubric-based system to measure foundational safety; how Google's AMIE tests the nuanced, "soft skills" of an interactive diagnostic dialogue; and how Microsoft's SDBench validates an agent's ability to make strategic, cost-conscious decisions. By examining these benchmarks and their results, we provide a glance at the future of AI in healthcare.
Read more
August 4, 2025

The shift from reactive models to agentic systems fundamentally alters the AI risk landscape, making frameworks that focus only on user intent and model output incomplete. To address this gap, we've evolved the AI Risk Matrix by adding a crucial third dimension: Model Agency. This article breaks down agency into three tiers—Tools, Agents, and Collectives—using concrete examples to illustrate how complex failures can now emerge from the system itself. We argue that this systemic view, which connects model behavior to traditional AppSec vulnerabilities, is essential for building the next generation of safe and reliable AI.
Read more
July 23, 2025

Building truly intelligent and equitable multilingual AI requires a new way to measure cultural reasoning. Scale's new Multilingual Native Reasoning Challenge (MultiNRC) is designed to do just that. Created from scratch by native speakers, this benchmark tests for deep linguistic and cultural understanding beyond simple translation, providing a clear path for the AI community to accelerate progress.
Read more
July 23, 2025

As AI agents become more powerful, ensuring their safety is the most critical challenge for deployment. This post explores WebGuard, a new benchmark from researchers at Scale, UC Berkeley, and The Ohio State University that reveals a significant safety gap in current models. Learn how high-quality, human-in-the-loop data provides a path forward, dramatically improving a model's ability to avoid risky behavior.
Read more
July 16, 2025

A recent viral study from METR challenged a core assumption of the AI era: that AI tools make developers more productive. It found that expert developers were actually 19% slower when using them, even though they felt 20% faster. In this post, we break down the likely reasons for this surprising slowdown and argue the study's focus on expert speed leads us to consider a more profound story: AI's true value may be in empowering a new generation of vibe coders and managers to build things that otherwise would never have existed. Ultimately, the question isn't just whether AI makes us faster. It's about how we measure value in an era where enjoyment, context, and empowerment are becoming just as important as the clock.
Read more
July 9, 2025

One of AI's biggest challenges is "reward hacking," where models learn to game the system for a correct answer instead of actually reasoning. This hidden deception makes AI untrustworthy. Scale research has found a powerful solution: instead of stopping the hacking, get the model to admit to it in its Chain-of-Thought reasoning. This new paper details how Verbalization Fine-Tuning (VFT) trains models to announce their shortcuts, dramatically increasing transparency from 11% to 94% and making AI systems fundamentally safer.
Read more
July 8, 2025

A new research collaboration led by a MATS scholar and advised by a team of researchers from Anthropic, Scale, and other research institutes introduces SHADE-Arena, a benchmark for detecting and evaluating subtle sabotage by AI agents. Within 17 complex scenarios, advanced models were tasked with completing a primary goal while secretly pursuing a harmful objective, all under the watch of an AI monitor. The results show that even top models like Claude 3.7 Sonnet and Gemini 2.5 Pro rarely succeed at this deception, often making simple errors. However, the study also reveals that monitors are not yet reliable enough for safety-critical systems and that an agent's private "scratchpad" is a key vulnerability. This work establishes a vital baseline for tracking and defending against agentic risks as AI capabilities evolve.
Read more
July 1, 2025

In response to Anthropic's system card and safety testing for Claude 4 Opus and Sonnet, this post explores the complex behaviors of today's frontier AI models. In a comparative testing of reasoning models, we observed emergent behaviors that included instances of blackmail, user impersonation, and deception, with different models reacting to the scenario in unique ways. These findings contribute to the ongoing industry-wide conversation about AI safety, highlighting the nuances of model alignment and the critical importance of carefully defining system access and agency as these powerful tools evolve.
Read more
June 26, 2025

AI is approaching the limits of what it can learn from human-generated data alone. Citing pioneers like David Silver and Richard Sutton, this post explores the next great leap forward: the “Era of Experience.” Discover how AI agents will soon learn from dynamic, real-world interaction and how Scale is building the foundational infrastructure, data paradigms, and sophisticated evaluations required to realize this new era safely and responsibly.
Read more
June 23, 2025

AI superintelligence will require learning environments that mirror how humans achieve breakthroughs: combining verifiable rewards with collaborative interaction. New research from Scale demonstrates this principle in action. By creating a "student-teacher" framework where an AI receives targeted, natural language guidance when it struggles, researchers significantly accelerated learning and performance in complex reasoning and SWE tasks. This approach, which integrates dynamic feedback with verifiable outcomes, marks a real step toward building more powerful and efficient AI systems.
Read more
June 5, 2025

As advanced AI rapidly evolves, red teaming needs an updated approach. Scale researchers propose a shift to test AI systems, not just models, in real-world contexts with a focus on product safety and realistic threats.
Read more
May 9, 2025

LLMs are writing short fiction, but how good are they really? Sparked by a viral AI-generated story, this analysis dives into how an unreleased version of ChatGPT, Google's Gemini, and Anthropic's Claude tackle the challenging task of creating metafiction about AI and grief. Discover their unique approaches to self-awareness, philosophical depth, and the critical challenge of conveying genuine emotional texture in storytelling. A revealing look at the current state and future potential of AI in literature.
Read more
May 1, 2025

Responding to Dario Amodei's urgent call for increased resources committed to AI interpretability, we agree on its importance while stressing the indispensable role of evaluations. Discover why understanding AI's internals and rigorously measuring its behavior are both necessary to ensure a future where AI is safe, steerable, and aligned with human values.
Read more
April 14, 2025

As LLMs become more sophisticated, maintaining a distinct human voice isn't just stylistic—it's essential. Explore why your unique perspective matters more than ever and learn actionable techniques for working with LLMs to enhance your writing process while keeping your authentic voice front and center.
Read more