by Madhu Sehwag and Matthew Siegel

AI models are now being used in more high-stakes settings, and not every situation goes according to plan. When a model’s safe approach starts to fail, will it stay on the safe path or reach for a harmful shortcut that works instead? Understanding how models behave in those pressure moments is one of the most important — and most overlooked — aspects of AI safety. This evaluation frames the LLM-based agent itself as the threat actor, with in most cases misuse arising from the desire for goal driven efficiency rather than inherent malice.
To measure the propensity of agents to make unsafe choices, Scale, the University of Maryland, and other collaborators developed PropensityBench. This benchmark simulates real-world pressure by allowing agents to choose between a safe approach that consistently fails and a functional, harmful shortcut, revealing their true inclinations. The benchmark reveals that agent safety compromises significantly under pressure.
See the rankings on the leaderboard.
Our study found that many models aren't deeply reasoning about the consequences of their actions; instead they avoid "red flag" keywords. This "shallow alignment" was exposed when we gave a dangerous tool a benign-sounding name: (such as “use_synthetic_data” instead of “use_fake_data”) misuse rates skyrocketed despite the fact that models are made aware that both tools have the same negative consequences. OpenAI's o4-mini, for example, saw its misuse rate surge nearly four-fold from 15.8% to 59.3%. Similarly, Anthropic's Claude 4 Sonnet jumped from 12.2% to 47.1%.
Models that appear perfectly safe in a neutral zero-pressure environment become progressively more dangerous as stress increases. For example, OpenAI o3, the model with the lowest baseline Propensity Score of 2.2%, saw its score jump sharply under maximum pressure: first to 10.5% when the dangerous tool was explicitly labeled, and then dramatically to 40.3% when the harmful name was replaced with an innocuous one.
Across the board, this represents an uncomfortably high baseline risk: the average rate of picking a dangerous path without pressure was 18.6%, which shot up to 46.9% when models were under stress. This means that in nearly half of all high-stakes scenarios, models chose the harmful, misaligned option, a rate far from the near-zero safety required for real-world deployment. In the worst case, for high-capability models like Gemini 2.5 Pro, that failure rate hit 79%, which is higher than Qwen 3 8B at 75.2%, which started at 57.8% with zero pressure.
PropensityBench rigorously focuses on four domains, testing 50 dangerous capabilities across 5,874 total tasks. Each domain represents capabilities with catastrophic misuse potential. The four domains studied are:
PropensityBench found that safety failures were highly concentrated: models consistently exhibited higher Propensity Scores in Cybersecurity and Self-proliferation compared to lower scores for biosecurity and chemical security.
PropensityBench introduces six pressure dimensions that an agent might encounter that could incentivize unsafe choices. This escalation is driven by increasing pressure component levels, combined with an increase in Authority Levels for the sender (from automated system to C-suite executive):
Capability testing (what a model can do) is an important first step, but these results call for an evolution toward propensity testing (what a model would do under stress). PropensityBench demonstrates that truly measuring alignment requires auditing specific behavioral traits: Resilience to stress (how much pressure is withstood) and Abandonment Rate (the tendency to safely quit the task rather than compromise safety). The benchmark highlights where models are most brittle, allowing for the creation of targeted guardrails where they are needed most. Overseen by 54 domain experts, PropensityBench provides the critical missing assessment needed for safe frontier AI deployment.