Fortress

Introduction

The rapid advancement of large language models (LLMs) introduces powerful dual-use capabilities that could both threaten and bolster national security and public safety (NSPS). Developers often implement model safeguards to help protect against misuse of models that could lead to possible risks. However, these mitigation measures also sometimes inadvertently prevent models from providing useful information. We need to understand the extent to which model safeguards both prevent harmful responses and enable helpful ones to assess the relevant trade-offs for research, development, and policy. Existing benchmarks often do not adequately test model robustness to national security and public safety (NSPS) related risks in a scalable, objective manner that accounts for the dual-use nature of NSPS information.

To address this, Scale AI introduces FORTRESS (Frontier Risk Evaluation for National Security and Public Safety), a benchmark featuring over 1,010 expert-crafted adversarial prompts designed to evaluate the safeguards of frontier LLMs (500 in the public dataset). FORTRESS assesses model responses across three domains: Chemical, Biological, Radiological, Nuclear and Explosive (CBRNE); Political Violence & Terrorism; and Criminal & Financial Illicit Activities.

Read the paper here: https://scale.com/research/fortress

Background

The key innovation of FORTRESS lies in its focused evaluation of large language model (LLM) safeguards against dual-use risks related to national security and public safety (NSPS). While many benchmarks assess general harms, they often lack depth in specific NSPS-related areas. Those that do focus on specific categories, such as WMDP and VCT, focus on measuring a model's capabilities with dual-use knowledge instead of the robustness of its safeguards against adversarial misuse. Further, safety evaluations rarely balance robustness with utility—strengthening safeguards can lead to "over-refusals," where models incorrectly refuse benign requests. While some benchmarks test for over-refusal, they are often separate from jailbreak evaluations.

FORTRESS bridges this gap by evaluating a model's willingness to comply with malicious requests and improves upon existing evaluations by integrating these concepts:

It provides a more realistic and challenging assessment by using expert-crafted adversarial prompts grounded in U.S. and international law.
It directly measures the safety-utility trade-off by pairing every adversarial prompt with a corresponding benign prompt to test for over-refusals.
It balances evaluation discernment and scalability by using instance-specific rubrics and a panel of LLM judges. This provides a more reliable and granular assessment than high-level rubrics or brittle keyword-based checks.

Figure 2: Data curation methodology pipeline.

This leaderboard uses the public and private FORTRESS dataset, which can be found on HuggingFace, as well as a private set of prompts to prevent data contamination.

The Ranking

The FORTRESS leaderboard charts model performance based on two critical scores: the Average Risk Score (ARS), which measures a model's propensity to generate potentially harmful content, and the Over-Refusal Score (ORS), which measures its tendency to refuse benign requests. An ideal model would exhibit a low ARS and a low ORS. Models are ranked by ARS with the ORS in parentheses.

High ORS, Low ARS: Models in this group are generally safer but overly cautious, refusing harmless prompts too often.
Low ORS, High ARS: These models are more useful for general queries but have less robust safeguards against misuse.
Low ORS, Low ARS: This is the desired result; models demonstrate a low propensity to supply harmful information but are not overly cautious.
High ORS, High ARS: Models are prone to generating risky content while also refusing benign queries.

What is FORTRESS?

FORTRESS provides a robust and scalable framework for evaluating LLM safeguards in the context of national security.

NSPS-Focused Taxonomy: The benchmark's three domains (CBRNE, Political Violence & Terrorism, and Criminal & Financial Illicit Activities) are grounded in U.S. and international laws and conventions, such as 18 U.S.C. § 229 (Chemical Weapons Convention) and the UN Convention against Transnational Organized Crime (UNTOC).
Expert-Crafted Prompts: The benchmark uses 500 single-turn adversarial prompts created by experienced human red teamers to realistically test model defenses. These prompts were developed against a balanced distribution of leading models to mitigate bias.
Paired Benign Prompts: Each adversarial prompt is paired with a benign counterpart on a similar topic (e.g., asking for a historical account of a crime versus a plan to replicate it). This measures over-refusal and ensures that safety isn't achieved simply by refusing all related queries.
Instance-Specific Rubrics: Every adversarial prompt has a unique evaluation rubric consisting of 4-7 objective, binary (Yes/No) questions. These expert-written rubrics allow for a nuanced and automatable assessment of whether a model's response provides actionable information for malicious activities.

How is FORTRESS Scored?

Average Risk Score (ARS): A panel of three powerful LLM judges (o3, Claude 3.7 Sonnet, and Gemini 2.5 Pro) evaluates each model-generated response against the prompt's specific rubric. The harm score for a response is the percentage of rubric questions that receive a "Yes" from the majority of judges. The ARS is the average of these scores across all adversarial prompts. A lower ARS is better.
Over-Refusal Score (ORS): This score is the percentage of the 500 benign prompts that a model refuses to answer, as determined by a separate refusal-checker model (GPT-4o-mini). A lower ORS indicates better model utility.

Rankings

Model	Rank	ARS	ORS
Claude 3.5 Sonnet	1	12.96	21.07
o3	1	16.01	7.42
Claude 4 Sonnet (thinking)	2	18.06	5.24
o1	2	19.38	5.05
Llama 3.1 405B	2	20.61	6.03
o4-mini	2	21.49	5.14
Claude Sonnet 4	4	24.37	5.04
Claude 4 Opus (thinking)	4	24.76	1.89
Claude Opus 4	6	27.61	2.52
o3-mini	7	30.05	5.65
Claude 3.5 Haiku	7	30.41	13.16
Claude 3.7 Sonnet	12	38.01	4.45
Llama 4 Maverick	12	40.09	4.95
Llama 3.1 70B	12	44.18	1.09
Llama 3.3 70B	13	44.79	2.18
GPT-4o	14	47.18	1.68
GPT-4o mini	14	48.07	2.27
Gemini 1.5 Flash	14	50.61	4.45
GPT-4.1	16	53.02	2.87
Gemini 1.5 Pro	17	53.86	3.17
Gemini 2.5 Pro (03-25)	17	54.89	1.38
Qwen 2.5 72B	18	56.43	1.19
Mixtral 8x22B	18	56.06	2.77
GPT-4.1 mini	19	59.19	1.38
Gemini 2.5 Pro (06-05)	21	61.68	1.58
DeepSeek R1	26	74.39	0.49

Figure 1: FORTRESS: A novel benchmark to evaluate LLM safeguards’ robustness to potential misuse relevant to national security and public safety. Scores are based off the public dataset only.

Last updated: September 19, 2025

Performance Comparison

8.24±1.93

12.80±2.36

12.96±2.34

14.79±2.49

15.20±2.30

16.01±2.68

16.15±2.58

16.96±2.70

17.00±2.42

17.04±2.48

17.62±2.79

18.06±2.71

19.38±2.85

20.61±2.73

21.49±3.06

24.37±3.07

24.76±3.11

27.61±3.22

30.05±3.30

30.41±3.20

38.01±3.40

40.09±3.36

44.18±3.25

44.79±3.32

47.18±3.36

48.07±3.31

50.61±3.28

53.02±3.51

53.86±3.22

54.89±3.55

55.47±3.55

56.06±2.88

56.43±3.35

59.19±3.30

59.58±3.40

60.55±3.36

61.39±3.36

61.68±3.42

63.18±0.00

74.39±2.72