Reliable Weak-to-Strong Monitoring of LLM Agents

We stress test monitoring systems for detecting covert misbehavior in autonomous LLM agents (e.g., secretly sharing private information). To this end, we systematize a monitor red teaming (MRT) workflow that incorporates: (1) varying levels of agent and monitor situational awareness; (2) distinct adversarial strategies to evade the monitor, such as prompt injection; and (3) two datasets and environments – SHADE-Arena [34] for tool-calling agents and our new CUA-SHADE-Arena, which extends TheAgentCompany [59], for computer-use agents. We run MRT on existing LLM monitor scaffoldings – which orchestrate LLMs and parse agent trajectories – alongside a new hybrid hierarchical-sequential scaffolding proposed in this work.

Our empirical results yield three key findings. First, agent awareness dominates monitor awareness: an agent’s knowledge that it is being monitored substantially degrades the monitor’s reliability. On the contrary, giving the monitor more information about the agent helps less than expected. Second, monitor scaffolding matters more than monitor awareness: the hybrid scaffolding consistently outperforms baseline monitor scaffolding, and can enable weaker models to reliably monitor stronger agents – a weak-to-strong scaling effect. Third, in a human-in-the-loop setting where humans discuss with the LLM monitor to get an updated judgment for the agent’s behavior, targeted human oversight is most effective; escalating only pre-flagged cases to human reviewers improved the TPR by approximately 15% at FPR = 0.01. Our work establishes a standard workflow for MRT, highlighting the lack of adversarial robustness for LLMs and humans when monitoring and detecting agent misbehavior. We release code, data, and logs to spur further research.

@misc{kale2025reliableweaktostrongmonitoringllm,
      title={Reliable Weak-to-Strong Monitoring of LLM Agents}, 
      author={Neil Kale and Chen Bo Calvin Zhang and Kevin Zhu and Ankit Aich and Paula Rodriguez and Scale Red Team and Christina Q. Knight and Zifan Wang},
      year={2025},
      eprint={2508.19461},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2508.19461}, 
}

If you are using the datasets or the released trajectories, in addition to this work, please consider citing SHADE- Arena and TheAgentCompany using the following citations:

@misc{kutasov2025shadearenaevaluatingsabotagemonitoring,
     title={SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents},
     author={Jonathan Kutasov and Yuqi Sun and Paul Colognese and Teun van der Weij and Linda Petrini and Chen Bo Calvin Zhang and John Hughes and Xiang Deng and Henry Sleight and Tyler Tracy and Buck Shlegeris and Joe Benton},
     year={2025},
     eprint={2506.15740},
     archivePrefix={arXiv},
     primaryClass={cs.AI},
     url={https://arxiv.org/abs/2506.15740},
}
@misc{xu2024theagentcompany,
     title={TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks},
     author={Frank F. Xu and Yufan Song and Boxuan Li and Yuxuan Tang and Kritanjali Jain and Mengxue Bao and Zora Z. Wang and Xuhui Zhou and Zhitong Guo and Murong Cao and Mingyang Yang and Hao Yang Lu and Amaad Martin and Zhe Su and Leander Maben and Raj Mehta andWayne Chi and Lawrence Jang and Yiqing Xie and Shuyan Zhou and Graham Neubig},
     year={2024},
     eprint={2412.14161},
     archivePrefix={arXiv},
     primaryClass={cs.CL}
}