Scale AI logo

Scale AI logo
  • Enterprise
  • Government
Book a Demo→
Log In
←Back to Blog

Matthew Siegel

1 article

July 8, 2025

Research

Detecting and Evaluating Agent Sabotage

Detecting and Evaluating Agent Sabotage

A new research collaboration led by a MATS scholar and advised by a team of researchers from Anthropic, Scale, and other research institutes introduces SHADE-Arena, a benchmark for detecting and evaluating subtle sabotage by AI agents. Within 17 complex scenarios, advanced models were tasked with completing a primary goal while secretly pursuing a harmful objective, all under the watch of an AI monitor. The results show that even top models like Claude 3.7 Sonnet and Gemini 2.5 Pro rarely succeed at this deception, often making simple errors. However, the study also reveals that monitors are not yet reliable enough for safety-critical systems and that an agent's private "scratchpad" is a key vulnerability. This work establishes a vital baseline for tracking and defending against agentic risks as AI capabilities evolve.

Read more

  • Products

    • Scale Data Engine
    • Scale GenAI Platform
    • Scale Donovan
    • Government

      • Public Sector
  • Company

    • About
    • Careers
    • Security
    • Terms
    • Privacy
    • Modern Slavery Statement
  • Resources

    • Blog
    • Contact Us
    • Customers
    • Events
    • Documentation
    • Guides
    • Community
    • Research
  • Guides

    • Data Labeling
    • ML Model Training
    • Diffusion Models
    • Guide to AI for eCommerce
    • Computer Vision Applications
    • Large Language Models
  • Follow Us

Copyright © 2026 Scale AI, Inc. All rights reserved.Terms of Use & Privacy Policy