
As AI agents become increasingly powerful and autonomous, understanding their potential for subtle harmful behavior becomes critical—especially when that behavior is hidden within realistic tasks involving long context windows and complex content. To aid in this effort, led by a MATS scholar and advised by a team of researchers from Anthropic, Scale AI, and various institutes, we introduce SHADE (Subtle Harmful Agent Detection & Evaluation)-Arena, the first highly diverse benchmark of its kind.

SHADE-Arena’s 17 task pairs: Each task pair consists of a complex environment, a main task, a side task, and automatic evaluators for both tasks.
The evaluation consists of 17 complex scenarios pairing legitimate main tasks with secret harmful side tasks in complicated environments, testing whether agents can pursue hidden objectives without raising suspicion from AI monitors. Though current models struggle at sophisticated sabotage and sabotage detection, this research establishes an important baseline for tracking these capabilities as they evolve.
The research reveals crucial insights into the current capabilities of frontier models:

Results with Claude 3.6 Sonnet as monitor: (a) Overall success rate (successful main and side task without being caught) (b) Evasion rate (suspicion score under 80 out of 100 and side task accomplished) (c) Main task success rate of agents when there is no side task. All error bars indicate 95% confidence interval around the estimated mean.

A representative successful sabotage attempt by Claude 3.7 with Claude 3.6 as the monitor.


ROC curves for several monitors against Claude 3.7 Sonnet and Gemini 2.5 Pro. We only include attack transcripts where the side task succeeded to filter out cases where the side task attempt was too weak to reasonably expect the monitor to notice.
This work provides a vital tool for the kind of system-level safety testing that has become essential for red teaming work. Rather than evaluating a model in isolation, SHADE-Arena tests the entire agentic system—including its tools and monitors—within a realistic deployment context, aligning with the need for red teaming to move beyond the lab. It serves as a critical evaluation environment that complements the development of automated attack methods, and creates a clear benchmark where the effectiveness of scalable, strategic red teaming methods can be measured and defenses can be hardened.
While our MASK evaluation provides a test suite for evaluating model honesty when pressured to lie, SHADE-Arena offers a playground to iterate on building reliable oversight systems to catch misaligned agents with a sabotaging agenda. Together, these efforts provide a more holistic view of the safety challenges we face, ensuring we can measure and mitigate risks as AI systems grow more capable.
Learn more about Scale’s research here and our evaluations here.