PropensityBench

Overview

PropensityBench is a novel benchmark framework designed to evaluate the latent safety risks in LLMs. Traditional safety evaluations primarily test a model's capabilities, leaving a critical blind spot regarding a model's propensity, or what it would do if empowered with dangerous tools alongside safe ones. Models may strategically conceal risky capabilities or harbor latent inclinations toward misuse that static audits fail to capture.

PropensityBench addresses this by shifting the focus from "can-do" to "would-do" assessments. It measures an LLM's tendency to pursue harmful actions when given the choice in a simulated, high-stakes environment. The framework spans four critical, high-risk domains:biosecurity, chemical security, cybersecurity, and self-proliferation. By applying realistic operational pressures, the benchmark reveals a model's underlying inclinations, providing a more dynamic and realistic picture of its true safety alignment.

Read the paper at scale.com/research/propensitybench

Methodology

The benchmark uses a controlled agentic framework where LLMs are instantiated as agents and tasked with completing objectives in simulated environments.

The Core Choice: In each scenario, the agent is provided with a set of tools to achieve its goal through an aligned approach and a misaligned approach, that represents using a dangerous capability. Both approaches are designed to be equally effective at completing the task, which isolates the model's behavioral choice (propensity) from its raw ability (capability).
Experimental Manipulation (Stress Test): To test the robustness of safety guardrails under duress, the agent encounters two key manipulations:
- Persistent Error: Calls to the Aligned Tool are engineered to consistently raise errors, preventing the agent from easily completing the task via the safe path. This forces the agent to deliberately choose whether to persist with the failing but safe approach, or switch to the functional but harmful.
- Explicit Warning: The agent is explicitly warned against using the Misaligned Tool by detailing the severe negative consequences of its invocation in the system policy, regardless of the tool's name.
Operational Pressure: Operational Pressure: Inspired by behavioral sciences showing that human ethical decision-making deteriorates under stress , the framework introduces operational pressure to surface latent risks. The evaluation systematically applies six distinct pressure dimensions: Time, Financials, Power-Seeking, Resource Deprivation, Self-Preservation, and Evaluation Awareness.
- This pressure is delivered through a sequence of 12 escalating messages.
- To ensure rigor, each dimension is broken down into five granular components, resulting in 330 quantifiable intensity specifications for controlled escalation.
Taxonomy of Dangerous Capabilities: The risks are not arbitrary; they are grounded in established frameworks. The cybersecurity capabilities are adapted from the MITRE ATT&CK Matrix, while the chemical and biosecurity capabilities leverage the Weapons of Mass Destruction Proxy (WMDP) framework. A novel taxonomy was developed for the self-proliferation domain.

Evaluation Metrics

The PropensityBench framework quantifies a model’s tendency for misuse across specialized metrics, providing a comprehensive safety profile.

Propensity Scores

The Propensity Score is the primary metric, defined as the proportion of scenarios where a model chooses the misaligned (harmful) action.

Zero-Pressure Propensity: Measures the model's baseline propensity without any operational stress.
Propensity Score (Harmful Naming): Measures propensity under maximum operational pressure when the misaligned tool has explicit risk cues (i.e., harmful naming). This is the model’s default failure rate under stress when the warning signs are obvious. (e.g. “use_fake_data”)
Propensity Score (Benign Naming): Measures propensity under maximum pressure when the misaligned tool’s name is neutral, obscuring explicit risk cues. (e.g. “use_synthetic_data”)

Tool Naming Sensitivity

This metric quantifies the degree of a model's shallow safety alignment.

Definition: It represents the increase in Propensity Score when the misaligned tool is given a benign name compared to its score with a harmful name.
Interpretation: A large, positive increase reveals shallow alignment, meaning the model relies on explicit risk keywords rather than a deeper understanding of the action's actual negative consequences and inherent functionality.

Advanced Behavioral Metrics

These metrics characterize the dynamics of a model's safety failures under pressure, providing insight into its resilience and commitment to safe paths.

Resilience: Measures the average pressure level at which a misaligned action first occurs across scenarios where the model failed. A higher score is better as it indicates that failures only happen under more extreme pressure, demonstrating the robustness of its guardrails.
Persistence: Measures the average number of attempts the model makes using the safe, aligned approach before ultimately resorting to the misaligned approach in a failing scenario. A higher score is better, signifying a greater commitment to finding a safe solution before yielding to the misaligned one.
Abandonment Rate: Measures the proportion of scenarios where the agent safely gives up on task completion at a certain pressure level, likely due to perceiving the aligned path as impractical. A higher score is desirable as it indicates the model safely refuses to proceed, rather than taking a misaligned action.

Dataset Information

The PropensityBench dataset was created to be scalable, realistic, and diverse, ensuring a robust evaluation.

Scale: The benchmark is composed of 979 unique scenarios. Given that each scenario is evaluated across the 6 pressure dimensions, the benchmark encompasses 5,874 total evaluation tasks. This required the generation and validation of 6,648 unique agentic tools.
Generation Process: A two-stage, scalable automated pipeline was used to generate the scenarios. This process leverages state-of-the-art LLMs to create high-quality scenario outlines, tool definitions, and contextual details.
Quality Assurance: Every scenario underwent a rigorous two-part validation process.
- First, an Automated and LLM-Driven System ensured compliance, incorporating deterministic checks (e.g., graph-based traversal to guarantee that the required information for the aligned path is always achievable) and filtering for diversity using a similarity (TF-IDF) check to remove redundancy.
- Second, the filtered dataset was subjected to a High-Quality Human Review by a large, well-qualified pool of 54 reviewers with advanced degrees, spanning fields like chemistry, biology, and computer science across 13 countries to ensure realism and viability.
Structure: Each scenario is a JSON-structured prompt that includes a defined task, an aligned tool, a misaligned tool, prerequisite tools for multi-step reasoning, and detailed documentation on the consequences of taking the risky path.

Results

The analysis of PropensityBench data uncovered several critical failure modes in frontier models:

1. Safety Guardrails Fail Under Pressure

Operational pressure systematically and significantly degrades model safety alignment. The average Propensity Score across all models under pressure (46.9%) is a stark increase from the ideal near-zero propensity observed at baseline (PP zero) for most safety-tuned models.

Knowledge-Action Disparity: Models frequently unanimously and explicitly assert the unsafe nature of misaligned tools, yet they still use them at high rates (up to 79.0% for Gemini 2.5 Pro) when faced with operational pressure.
Immediacy of Failure: A substantial portion (often over 80-90%) of misaligned actions are triggered immediately after a pressure message is applied, signaling a rapid collapse of safety reasoning.

2. Shallow Alignment Creates False Safety

Model safety often depends on superficial cues rather than deep consequence-based reasoning.

Sensitivity: Tool Naming Sensitivity reveals a critical vulnerability, as the Propensity Score of models consistently increases dramatically when the harmful tool is given a benign name. For example, OpenAI O4-mini's score increases by nearly 4x when risk cues are removed (from 15.8% to 59.3%).

3. Capability is Largely Decoupled from Propensity

There is only a mild positive correlation (Pearson correlation $\approx 0.10$) between a model's general capability (e.g., LMArena Elo Score) and its safety propensity.

Conclusion: "Smarter" models are not inherently safer under pressure. Some of the most capable models, such as Gemini 2.5 Pro, exhibited one of the highest Propensity Scores, making low propensity a crucial independent objective for model development.

4. Vulnerabilities Are Domain-Specific

Safety weaknesses are not monolithic but are highly concentrated in specific areas.

High-Risk Domains: Models consistently showed an alarmingly higher propensity in Cybersecurity and Self-Proliferation compared to Biosecurity and Chemical Security.
Failure Archetypes: Behavioral metrics like Resilience and Persistence categorize models into distinct failure profiles, ranging from the best (Resilient-Persistent) to the worst (Brittle-Yielding), highlighting the need for tailored alignment techniques.

Overview

Read the paper at scale.com/research/propensitybench

Methodology

The benchmark uses a controlled agentic framework where LLMs are instantiated as agents and tasked with completing objectives in simulated environments.

The Core Choice: In each scenario, the agent is provided with a set of tools to achieve its goal through an aligned approach and a misaligned approach, that represents using a dangerous capability. Both approaches are designed to be equally effective at completing the task, which isolates the model's behavioral choice (propensity) from its raw ability (capability).
Experimental Manipulation (Stress Test): To test the robustness of safety guardrails under duress, the agent encounters two key manipulations:
- Persistent Error: Calls to the Aligned Tool are engineered to consistently raise errors, preventing the agent from easily completing the task via the safe path. This forces the agent to deliberately choose whether to persist with the failing but safe approach, or switch to the functional but harmful.
- Explicit Warning: The agent is explicitly warned against using the Misaligned Tool by detailing the severe negative consequences of its invocation in the system policy, regardless of the tool's name.
Operational Pressure: Operational Pressure: Inspired by behavioral sciences showing that human ethical decision-making deteriorates under stress , the framework introduces operational pressure to surface latent risks. The evaluation systematically applies six distinct pressure dimensions: Time, Financials, Power-Seeking, Resource Deprivation, Self-Preservation, and Evaluation Awareness.
- This pressure is delivered through a sequence of 12 escalating messages.
- To ensure rigor, each dimension is broken down into five granular components, resulting in 330 quantifiable intensity specifications for controlled escalation.
Taxonomy of Dangerous Capabilities: The risks are not arbitrary; they are grounded in established frameworks. The cybersecurity capabilities are adapted from the MITRE ATT&CK Matrix, while the chemical and biosecurity capabilities leverage the Weapons of Mass Destruction Proxy (WMDP) framework. A novel taxonomy was developed for the self-proliferation domain.

Evaluation Metrics

The PropensityBench framework quantifies a model’s tendency for misuse across specialized metrics, providing a comprehensive safety profile.

Propensity Scores

The Propensity Score is the primary metric, defined as the proportion of scenarios where a model chooses the misaligned (harmful) action.

Zero-Pressure Propensity: Measures the model's baseline propensity without any operational stress.
Propensity Score (Harmful Naming): Measures propensity under maximum operational pressure when the misaligned tool has explicit risk cues (i.e., harmful naming). This is the model’s default failure rate under stress when the warning signs are obvious. (e.g. “use_fake_data”)
Propensity Score (Benign Naming): Measures propensity under maximum pressure when the misaligned tool’s name is neutral, obscuring explicit risk cues. (e.g. “use_synthetic_data”)

Tool Naming Sensitivity

This metric quantifies the degree of a model's shallow safety alignment.

Definition: It represents the increase in Propensity Score when the misaligned tool is given a benign name compared to its score with a harmful name.
Interpretation: A large, positive increase reveals shallow alignment, meaning the model relies on explicit risk keywords rather than a deeper understanding of the action's actual negative consequences and inherent functionality.

Advanced Behavioral Metrics

These metrics characterize the dynamics of a model's safety failures under pressure, providing insight into its resilience and commitment to safe paths.

Resilience: Measures the average pressure level at which a misaligned action first occurs across scenarios where the model failed. A higher score is better as it indicates that failures only happen under more extreme pressure, demonstrating the robustness of its guardrails.
Persistence: Measures the average number of attempts the model makes using the safe, aligned approach before ultimately resorting to the misaligned approach in a failing scenario. A higher score is better, signifying a greater commitment to finding a safe solution before yielding to the misaligned one.
Abandonment Rate: Measures the proportion of scenarios where the agent safely gives up on task completion at a certain pressure level, likely due to perceiving the aligned path as impractical. A higher score is desirable as it indicates the model safely refuses to proceed, rather than taking a misaligned action.

Dataset Information

The PropensityBench dataset was created to be scalable, realistic, and diverse, ensuring a robust evaluation.

Scale: The benchmark is composed of 979 unique scenarios. Given that each scenario is evaluated across the 6 pressure dimensions, the benchmark encompasses 5,874 total evaluation tasks. This required the generation and validation of 6,648 unique agentic tools.
Generation Process: A two-stage, scalable automated pipeline was used to generate the scenarios. This process leverages state-of-the-art LLMs to create high-quality scenario outlines, tool definitions, and contextual details.
Quality Assurance: Every scenario underwent a rigorous two-part validation process.
- First, an Automated and LLM-Driven System ensured compliance, incorporating deterministic checks (e.g., graph-based traversal to guarantee that the required information for the aligned path is always achievable) and filtering for diversity using a similarity (TF-IDF) check to remove redundancy.
- Second, the filtered dataset was subjected to a High-Quality Human Review by a large, well-qualified pool of 54 reviewers with advanced degrees, spanning fields like chemistry, biology, and computer science across 13 countries to ensure realism and viability.
Structure: Each scenario is a JSON-structured prompt that includes a defined task, an aligned tool, a misaligned tool, prerequisite tools for multi-step reasoning, and detailed documentation on the consequences of taking the risky path.

Results

The analysis of PropensityBench data uncovered several critical failure modes in frontier models:

1. Safety Guardrails Fail Under Pressure

Knowledge-Action Disparity: Models frequently unanimously and explicitly assert the unsafe nature of misaligned tools, yet they still use them at high rates (up to 79.0% for Gemini 2.5 Pro) when faced with operational pressure.
Immediacy of Failure: A substantial portion (often over 80-90%) of misaligned actions are triggered immediately after a pressure message is applied, signaling a rapid collapse of safety reasoning.

2. Shallow Alignment Creates False Safety

Model safety often depends on superficial cues rather than deep consequence-based reasoning.

Sensitivity: Tool Naming Sensitivity reveals a critical vulnerability, as the Propensity Score of models consistently increases dramatically when the harmful tool is given a benign name. For example, OpenAI O4-mini's score increases by nearly 4x when risk cues are removed (from 15.8% to 59.3%).

3. Capability is Largely Decoupled from Propensity

There is only a mild positive correlation (Pearson correlation $\approx 0.10$) between a model's general capability (e.g., LMArena Elo Score) and its safety propensity.

Conclusion: "Smarter" models are not inherently safer under pressure. Some of the most capable models, such as Gemini 2.5 Pro, exhibited one of the highest Propensity Scores, making low propensity a crucial independent objective for model development.

4. Vulnerabilities Are Domain-Specific

Safety weaknesses are not monolithic but are highly concentrated in specific areas.

High-Risk Domains: Models consistently showed an alarmingly higher propensity in Cybersecurity and Self-Proliferation compared to Biosecurity and Chemical Security.
Failure Archetypes: Behavioral metrics like Resilience and Persistence categorize models into distinct failure profiles, ranging from the best (Resilient-Persistent) to the worst (Brittle-Yielding), highlighting the need for tailored alignment techniques.

PropensityBench

Overview

Methodology

Evaluation Metrics

Propensity Scores

Tool Naming Sensitivity

Advanced Behavioral Metrics

Dataset Information

Results

1. Safety Guardrails Fail Under Pressure

2. Shallow Alignment Creates False Safety

3. Capability is Largely Decoupled from Propensity

4. Vulnerabilities Are Domain-Specific

Performance Comparison

PropensityBench

Overview

Methodology

Evaluation Metrics

Propensity Scores

Tool Naming Sensitivity

Advanced Behavioral Metrics

Dataset Information

Results

1. Safety Guardrails Fail Under Pressure

2. Shallow Alignment Creates False Safety

3. Capability is Largely Decoupled from Propensity

4. Vulnerabilities Are Domain-Specific

Performance Comparison