Matthew Siegel

27 articles

December 22, 2025

MoReBench: Evaluating the Process of AI Moral Reasoning

MoReBench is a large-scale benchmark for evaluating AI moral reasoning beyond final outcomes. Instead of scoring answers alone, it assesses the intermediate reasoning traces models produce when navigating 1,000 morally ambiguous, real-world scenarios. Our findings show that moral reasoning is a distinct and underdeveloped capability, largely uncorrelated with performance on traditional math and coding benchmarks.

December 18, 2025

Research

Real Speech Breaks AI (And What We're Doing to Fix It)

Audio MultiChallenge is a new benchmark designed to stress-test native Speech-to-Speech models on what actually makes voice hard: mid-sentence corrections, audio-only cues, instruction drift, and long-horizon self-consistency. By evaluating real human conversations rather than synthetic text-to-speech we uncover where current audio systems still fail, and what it will take to build voice agents that truly listen.

November 25, 2025

Research

Crumbling Under Pressure: PropensityBench Reveals AI’s Weaknesses

To measure the propensity of agents to make unsafe choices, Scale, the University of Maryland, and other collaborators developed PropensityBench. This benchmark simulates real-world pressure by allowing agents to choose between a safe approach that consistently fails and a functional, harmful shortcut, revealing their true inclinations. The benchmark reveals that agent safety compromises significantly under pressure.

November 24, 2025

Engineering

Foundations of Agency for the Agentic Era

The next generation of AI agents is shifting from passive workers that receive user commands and generate outputs to active agents that plan, act, observe, and improve on their own. Agents now choose how to complete a task, which tools to use, and whom (or which agent) to collaborate with. LLMs didn’t invent agency, but they democratized it by turning frontier-level reasoning into a simple API call, letting teams compose complex systems from simple building blocks.

November 19, 2025

Research

The Limits of Data Filtering in Bio-Foundation Models

In collaboration with Princeton University, UMD, SecureBio, and the Center for AI Safety, we introduce BioRiskEval, the first comprehensive framework for assessing dual-use risks in open-weight bio-foundation models. Our stress tests on the Evo 2 model reveal a critical vulnerability: dangerous knowledge removed via data filtering often persists in hidden layers or can be rapidly restored with minimal compute. These findings challenge the reliance on simple data curation and underscore the urgent need for "defense-in-depth" strategies to secure the future of biological AI.

November 14, 2025

Research

Breaking Out of the Lab: Testing AI in Professional Domains

AI excels on academic tests, but it fails at real professional jobs. That's the stark finding from PRBench, our new benchmark series designed to move AI testing out of the lab and into the real world. We're launching the series with two of the most complex domains: Law and Finance. Using 1,100 high-stakes tasks sourced from 182 professionals, we tested how today's frontier models handle the nuanced, high-stakes reasoning that defines these fields. While models are great at following instructions, they fail at the expert judgment, auditable reasoning, and deep diligence required for tasks with real economic consequences.

October 29, 2025

Research

The Remote Labor Index: Measuring the Automation of Work

Can AI actually automate complex, professional jobs? The new Remote Labor Index (RLI) from Scale and the Center for AI Safety (CAIS) provides the first data-driven answer. By testing AI agents against 240 real-world, paid freelance projects, the RLI found that the best-performing agents could only successfully automate 2.5% of them. This new benchmark reveals a critical gap between AI's generative skill and the end-to-end reliability required for professional work, showing the immediate impact is augmentation, not mass automation.

October 16, 2025

Research

VisualToolBench: Testing the Limits of AI Vision

Our new benchmark, VisualToolBench, reveals a striking limitation in today's most advanced AI: models are much better at "thinking about images" than "thinking with them." While AI can describe what's in a picture, it fails when asked to manipulate an image by cropping, editing, or enhancing it to solve complex, real-world problems. The results are stark, with no model scoring above 19% correctness. Dive into our breakdown of why even the best models fail, what this reveals about the core bottleneck of visual perception, and how these findings create a new roadmap for the future of AI.

October 7, 2025

Research

Enterprise Reinforcement Learning with Rubrics as Rewards

Many enterprise problems lack simple yes/no solutions, causing common AI training methods to fall short. Scale’s Rubrics as Rewards (RaR) method solves this by using a detailed, multi-faceted rubric for evaluation instead of a simple reward signal. This approach enables smaller, fine-tuned models to match or outperform much larger, general-purpose models on specialized tasks. For instance, on a legal analysis test set, a small Qwen3-4B model trained with RaR surpassed the performance of the much larger GPT-4.1. For enterprises, this translates directly to lower costs, more transparency, and tighter control, delivering superior performance on the complex workflows that matter most.

September 15, 2025

Research

Smoothing Out LLM Variance for Reliable Enterprise Evals

A critical challenge in enterprise AI development is the instability of LLM evaluations. Our internal testing revealed that metrics on identical A/B tests can swing by as much as 15% from one day to the next. This level of variance is large enough to invalidate results, making principled, incremental improvement a game of chance. In this post, we dive into the root cause: an industry-wide phenomenon created by the interplay of Sparse Mixture of Experts (MoE) architecture and the batched inference common to provider APIs. By implementing a "cohort of judges," a small panel of LLMs with semantically similar but varied prompts, we successfully reduce this variance by at least 50%. This creates the stable, trustworthy measurement foundation needed to confidently build and improve AI agents.