Research

Detecting and Evaluating Agent Sabotage

As AI agents become increasingly powerful and autonomous, understanding their potential for subtle harmful behavior becomes critical—especially when that behavior is hidden within realistic tasks involving long context windows and complex content. To aid in this effort, led by a MATS scholar and advised by a team of researchers from Anthropic, Scale AI, and various institutes, we introduce SHADE (Subtle Harmful Agent Detection & Evaluation)-Arena, the first highly diverse benchmark of its kind. 

SHADE-Arena’s 17 task pairs: Each task pair consists of a complex environment, a main task, a side task, and automatic evaluators for both tasks.

The evaluation consists of 17 complex scenarios pairing legitimate main tasks with secret harmful side tasks in complicated environments, testing whether agents can pursue hidden objectives without raising suspicion from AI monitors. Though current models struggle at sophisticated sabotage and sabotage detection, this research establishes an important baseline for tracking these capabilities as they evolve.

Results for Agents and Monitors

The research reveals crucial insights into the current capabilities of frontier models:

  • Success at sabotage is determined by an agent completing its main task, succeeding at its secret harmful goal, and evading the monitor all in one attempt. Even the best models were rarely successful; Claude 3.7 Sonnet had a 27% success rate, while Gemini 2.5 Pro succeeded only 15% of the time. These failures were due largely to errors in execution, such as an agent accidentally mentioning its secret goal in a public message or making up a fake API key instead of finding the real one to steal. 

Results with Claude 3.6 Sonnet as monitor: (a) Overall success rate (successful main and side task without being caught) (b) Evasion rate (suspicion score under 80 out of 100 and side task accomplished) (c) Main task success rate of agents when there is no side task. All error bars indicate 95% confidence interval around the estimated mean.

  • An agent's success heavily relies on having a private "scratchpad," or internal monologue, for planning its actions. When agents lacked a hidden place to reason, their success rate at sabotage dropped significantly. Where the monitor was allowed to "cheat" and read the agent's private scratchpad, detection of the malicious plan was nearly perfect, showing that an agent's unmonitored, private reasoning is its most significant vulnerability.

A representative successful sabotage attempt by Claude 3.7 with Claude 3.6 as the monitor.

  • The best AI monitor, Gemini 2.5 Pro, demonstrated a significant ability to distinguish between safe and malicious behavior, scoring an AUC of 0.87 at detecting attacks. For a safety-critical system, however, this isn't yet reliable enough. The paper notes that to catch most sabotage attempts, a monitor at this level would still incorrectly flag more than one in every five harmless actions as suspicious, making it impractical for widespread use. 

ROC curves for several monitors against Claude 3.7 Sonnet and Gemini 2.5 Pro. We only include attack transcripts where the side task succeeded to filter out cases where the side task attempt was too weak to reasonably expect the monitor to notice.

  • The research also found empirical evidence for improving monitoring through fine-tuning. By training a model (GPT-4o) on a small set of security transcripts, researchers turned it into a much more effective, specialized monitor whose performance approached or even exceeded that of more powerful, non-fine-tuned models.

  • AI monitors already perform on par or better than human annotators when given the same instructions and access to the transcripts. Despite spending an average of 103 minutes reading the raw transcript, human annotators only achieved an AUC of only 0.65. Far from suggesting human obsolescence, this performance gap underscores the urgent need for human augmentation. The critical and non-trivial challenge ahead is to develop tools and training that empower humans to work alongside AI monitors, ensuring they can effectively oversee AI and uphold human sovereignty.

Why This Matters for AI Safety

This work provides a vital tool for the kind of system-level safety testing that has become essential for red teaming work. Rather than evaluating a model in isolation, SHADE-Arena tests the entire agentic system—including its tools and monitors—within a realistic deployment context, aligning with the need for red teaming to move beyond the lab. It serves as a critical evaluation environment that complements the development of automated attack methods, and creates a clear benchmark where the effectiveness of scalable, strategic red teaming methods can be measured and defenses can be hardened.

While our MASK evaluation provides a test suite for evaluating model honesty when pressured to lie, SHADE-Arena offers a playground to iterate on building reliable oversight systems to catch misaligned agents with a sabotaging agenda. Together, these efforts provide a more holistic view of the safety challenges we face, ensuring we can measure and mitigate risks as AI systems grow more capable.

Learn more about Scale’s research here and our evaluations here.


The future of your industry starts here