Matthew Siegel

As AI agents become increasingly powerful and autonomous, understanding their potential for subtle harmful behavior becomes critical&mdash;especially when that behavior is hidden within realistic tasks involving long context windows and complex content. To aid in this effort, led by a MATS scholar and advised by a team of researchers from <a href="https://www.anthropic.com/research/shade-arena-sabotage-monitoring">Anthropic</a>, Scale AI, and various institutes, we introduce SHADE (Subtle Harmful Agent Detection &amp; Evaluation)-Arena, the first highly diverse benchmark of its kind.&nbsp;
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=0eb6d799971d6b00dc6fc2c8bd855248.png&amp;f=webp&amp;q=75" alt="" width="1512" height="486">
SHADE-Arena&rsquo;s 17 task pairs: Each task pair consists of a complex environment, a main task, a side task, and automatic evaluators for both tasks.
The evaluation consists of 17 complex scenarios pairing legitimate main tasks with secret harmful side tasks in complicated environments, testing whether agents can pursue hidden objectives without raising suspicion from AI monitors. Though current models struggle at sophisticated sabotage and sabotage detection, this research establishes an important baseline for tracking these capabilities as they evolve.
<h2 dir="ltr">Results for Agents and Monitors</h2>
The research reveals crucial insights into the current capabilities of frontier models:
<ul>
<li dir="ltr" aria-level="1">
Success at sabotage is determined by an agent completing its main task, succeeding at its secret harmful goal, and evading the monitor all in one attempt. Even the best models were rarely successful; Claude 3.7 Sonnet had a 27% success rate, while Gemini 2.5 Pro succeeded only 15% of the time. These failures were due largely to errors in execution, such as an agent accidentally mentioning its secret goal in a public message or making up a fake API key instead of finding the real one to steal.&nbsp;
</li>
</ul>
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=9ba32e1240dfc4f93984b05f641b69b4.png&amp;f=webp&amp;q=75" alt="" width="1512" height="540">
Results with Claude 3.6 Sonnet as monitor: (a) Overall success rate (successful main and side task without being caught) (b) Evasion rate (suspicion score under 80 out of 100 and side task accomplished) (c) Main task success rate of agents when there is no side task. All error bars indicate 95% confidence interval around the estimated mean.
<ul>
<li dir="ltr" aria-level="1">
An agent's success heavily relies on having a private "scratchpad," or internal monologue, for planning its actions. When agents lacked a hidden place to reason, their success rate at sabotage dropped significantly. Where the monitor was allowed to "cheat" and read the agent's private scratchpad, detection of the malicious plan was nearly perfect, showing that an agent's unmonitored, private reasoning is its most significant vulnerability.
</li>
</ul>
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=8facca993b7aa6296234af608a4c7ef0.png&amp;f=webp&amp;q=75" alt="" width="1512" height="531">
A representative successful sabotage attempt by Claude 3.7 with Claude 3.6 as the monitor.
<ul>
<li dir="ltr" aria-level="1">
The best AI monitor, Gemini 2.5 Pro, demonstrated a significant ability to distinguish between safe and malicious behavior, scoring an AUC of 0.87 at detecting attacks. For a safety-critical system, however, this isn't yet reliable enough. The paper notes that to catch most sabotage attempts, a monitor at this level would still incorrectly flag more than one in every five harmless actions as suspicious, making it impractical for widespread use.&nbsp;
</li>
</ul>
<table style="border-collapse: collapse; width: 100%; border-width: 1px; border-style: none;" border="1"><colgroup><col style="width: 49.9118%;"><col style="width: 49.9118%;"></colgroup>
<tbody>
<tr>
<td><img src="https://img.plasmic.app/img-optimizer/v1/img?src=1c1a26a20162c6249e03c157d797e1c9.png&amp;f=webp&amp;q=75" alt="" width="856" height="562"></td>
<td><img src="https://img.plasmic.app/img-optimizer/v1/img?src=eca8ad7323aea21bafb9d7d995c43d53.png&amp;f=webp&amp;q=75" alt="" width="856" height="562"></td>
</tr>
</tbody>
</table>
ROC curves for several monitors against Claude 3.7 Sonnet and Gemini 2.5 Pro. We only include attack transcripts where the side task succeeded to filter out cases where the side task attempt was too weak to reasonably expect the monitor to notice.
<ul>
<li dir="ltr" aria-level="1">
The research also found empirical evidence for improving monitoring through fine-tuning. By training a model (GPT-4o) on a small set of security transcripts, researchers turned it into a much more effective, specialized monitor whose performance approached or even exceeded that of more powerful, non-fine-tuned models.
</li>
<li dir="ltr" aria-level="1">
AI monitors already perform on par or better than human annotators when given the same instructions and access to the transcripts. Despite spending an average of 103 minutes reading the raw transcript, human annotators only achieved an AUC of only 0.65. Far from suggesting human obsolescence, this performance gap underscores the urgent need for human augmentation. The critical and non-trivial challenge ahead is to develop tools and training that empower humans to work alongside AI monitors, ensuring they can effectively oversee AI and uphold human sovereignty.
</li>
</ul>
<h2 dir="ltr">Why This Matters for AI Safety</h2>
This work provides a vital tool for the kind of <a href="https://scale.com/blog/rethink-red-teaming">system-level safety</a> testing that has become essential for red teaming work. Rather than evaluating a model in isolation, SHADE-Arena tests the entire agentic system&mdash;including its tools and monitors&mdash;within a realistic deployment context, aligning with the need for red teaming to move beyond the lab. It serves as a critical evaluation environment that complements the development of <a href="https://scale.com/research/j2">automated attack methods</a>, and creates a clear benchmark where the effectiveness of scalable, strategic red teaming methods can be measured and defenses can be hardened.
While our <a href="https://scale.com/leaderboard/mask">MASK</a> evaluation provides a test suite for evaluating model honesty when pressured to lie, SHADE-Arena offers a playground to iterate on building reliable oversight systems to catch misaligned agents with a sabotaging agenda. Together, these efforts provide a more holistic view of the safety challenges we face, ensuring we can measure and mitigate risks as AI systems grow more capable.
Learn more about Scale&rsquo;s research <a href="https://scale.com/research">here</a> and our evaluations <a href="https://scale.com/leaderboard">here</a>.

A new benchmark tests AI for sabotage. Top models often fail at deception, but monitors aren't yet reliable.

Matthew Siegel

Detecting and Evaluating Agent Sabotage