Research

Diagnosing AI: Advancing Interpretability and Evaluations

There's plenty in nature we don’t fully understand–the human mind, for example. But very few human inventions are quite so confounding. Modern AI presents us with a kind of technological anomaly–humans built it, but do not fully understand how it works. We can make educated hypotheses as to why an AI model might, for example, provide a certain answer to a query, but researchers have mapped out precious few examples of exactly how the model arrived at that answer. 

In “The Urgency of Interpretability” Anthropic CEO Dario Amodei makes a compelling case for committing far more resources into understanding more clearly how these models function. We agree that it is “unacceptable for humanity to be totally ignorant” of how AI models work, given their potential impact. In order for increasingly powerful and complex models to remain safe, steerable, and useful, it is essential to better understand how they work. 

This effort, analogous to the work of an MRI revealing inner workings of the human body, is known as mechanistic interpretability–mech interp, for short. Mech interp is one of the most complex areas in all of computer science (this interview with Anthropic's Chris Olah is a great primer). However, the field “must move fast if we want interpretability to mature in time to matter,” Amodei writes. Understanding these internal workings is crucial, but ultimately must connect to how these models actually perform and behave in practice. 

If Mech Interp is How, Evals are What 

While mechanistic interpretability reveals the how of AI, evaluations provide the what – measuring the distance between a model’s intended purposes and its behavior. Achieved through systematic testing using methods like automated scoring and expert human review, evals measure performance, capabilities, safety limits, and potential harms across diverse scenarios. They clearly demonstrate where models succeed or fail, highlighting behaviors needing attention and surfacing specific failures that signal where deeper diagnostics, like mech interp, are most needed.

Capturing this behavioral ground truth reliably at scale, especially against contamination risks, presents significant technical challenges, demanding robust infrastructure, test integrity, and domain expertise. At Scale, we focus on building this essential foundation for trustworthy, large-scale observation. Recognizing that no single approach reveals the entire story, combining this behavioral evidence from evals with insights from mech interp offers the clearest path towards building AI systems beneficial for society and the AI community. 

Evals and Mech Interp Inform Each Other

Evaluations and mechanistic interpretability working together help tackle AI’s biggest safety and steerability challenges, like the potential for deception Amodei discussed. Benchmarks like Scale’s MASK attempt to measure exactly this: which models remain truthful when pressured to lie? Providing evidence to guide mech interp’s difficult search for its underlying causes is a crucial step toward prevention. 

Of course, relying only on external behavior makes it hard to thoroughly rule out deception since, as Amodei noted, without mech interp, we can’t “catch the models ‘red-handed’ thinking…deceitful thoughts.” But when mech interp does identify potential internal warning signs or concerning mechanisms, that knowledge isn't just academic; it's highly actionable for model builders and researchers. For Scale, mech interp can help inform the development of even more sophisticated evals, designed to detect subtler forms of manipulation or to stress-test the newly understood vulnerability. 

This collaborative cycle extends further into validation: insights from mech interp might suggest specific interventions or training adjustments to fix a flaw, but the effectiveness of those changes must then be confirmed through targeted follow-up evals. Ultimately, building holistic safety cases for deploying advanced AI likely requires both strong behavioral assurances from comprehensive evals and growing mechanistic understanding from mech interp. Combining these perspectives is essential for achieving trust and managing risk. 

Scale Eval Examples + SEAL Leaderboards

One challenge of evaluations is that they don’t always have such a long shelf-life. As models improve, standard benchmarks like MMLU or GPQA become too easy, rendering them less useful. There's also the possibility that models might have been trained on the test data itself. Newer, tougher evaluations, therefore, are necessary to continually get a clear view of model capabilities. As of today, these are the evaluations we are running, some created in collaboration with the Center for AI Safety (CAIS):

  • Humanity's Last Exam (with CAIS): Addresses benchmark saturation with extremely difficult academic and scientific problems designed to test reasoning depth and knowledge at the true frontier, where current models still struggle.

  • MASK (with CAIS): Specifically measures model honesty (consistency between belief and stated answers), distinct from factual accuracy, particularly when models face pressure to mislead.

  • EnigmaEval: Probes creative, unstructured reasoning and information synthesis using complex puzzle hunt problems that go beyond conventional, well-defined benchmarks.

  • MultiChallenge: Assesses the integrated skills (context memory, reasoning, consistency) required for effective multi-turn conversational abilities, a crucial but often under-tested area.

  • VISTA: Evaluates deep visual-language reasoning processes, not just final answer accuracy, across diverse images and graphics using a detailed rubric-based assessment.

Results of these evals are publicly available on our SEAL Leaderboards, where we also share details of our methodologies. To ensure integrity and fairness, we rely on curated, private datasets to prevent contamination or overfitting. Additionally, we integrate expert assessment for particularly complex tasks. Deprecated leaderboards are also available for viewing. 

Why Both Evals and Interpretability are Urgent

The "race between interpretability and model intelligence" demands immediate action as frontier models rapidly advance. In this high-stakes context, evaluations are indispensable, providing essential, scalable means to track progress, steer development safely, and detect risks before they become crises. This work requires committed collaboration – reflected in efforts like our partnerships with CAIS on key benchmarks and Scale’s work with the U.S. AI Safety Institute to co-develop improved testing methods. 

While robust evaluations provide the crucial checks on what models do, we also urgently need the deeper understanding sought by mechanistic interpretability, as Amodei argues, to address inherent behavioral limits and build more fundamental trust. We must accelerate progress on both fronts simultaneously, building a future where AI is not just powerful, but understandable, steerable, and worthy of humanity's trust.

The Urgency of a Shared Vision for Safer AI

We cannot predict exactly what the future of AI will look like, but it is essential that we are appropriately positioned to ensure it remains aligned with human values. Guiding AI effectively demands significantly greater investments in both mechanistic interpretability and evaluations. Scale is committed to collaborating across the AI ecosystem to achieve the best outcomes for AI and for society. For the compelling case on interpretability itself, Amodei's essay reminds us that interpretability is integral to navigating AI responsibly.




The future of your industry starts here.