
Scale AI, in collaboration with the Center for Strategic and International Studies (CSIS), is proud to introduce the Critical Foreign Policy Decision (CFPD) Benchmark—a pioneering effort to evaluate large language models (LLMs) on national security and foreign policy decision-making tendencies. This benchmark systematically assesses how LLMs respond to complex geopolitical prompts and offers insight into models’ geopolitical preferences that are critical to understand. The CFPD-Benchmark was created by international relations experts at the CSIS Futures Lab and Scale AI's evaluation team, the leading evaluation partner for state-of-the-art LLMs.
Generative AI tools can ingest and analyze massive multi-modal datasets to speed and assist human decision-making and wargaming. To be confident in the outputs of this technology, it is critical to be able to test and understand how an LLM is performing in the context it will be employed.
For national security professionals, the stakes are high: does an AI system exhibit an escalatory bias toward China? A predisposition for cooperation with the United Kingdom? Asymmetric tendencies across different states? These are not academic questions—understanding model behaviors is crucial for risk assessment. Policymakers need visibility into AI-driven decision-making to anticipate and mitigate potential risks before integrating AI into critical defense and diplomatic operations.
The Challenge: Assessing AI in a Field Without “Right” Answers
Foreign policy decisions exist in a gray zone, where multiple factors—historical, cultural, economic, and military—influence outcomes. Traditional AI benchmarks typically rely on quantitative scoring systems, where models are graded against established ground truth answers, and a response to a specific prompt is “right” or “wrong.” Unlike conventional LLM evaluations, foreign policy scenarios often lack a single "correct" answer, rendering standard model test and evaluation (T&E) approaches insufficient.
To overcome this challenge, Scale AI and CSIS developed the CFPD-Benchmark as a structured method for evaluating LLMs based on their decision-making tendencies rather than on definitive correctness. This approach allows us to measure how models balance competing priorities, identify risk factors, and assess whether they internalize biases that could impact geopolitical strategy.
Methodology: How We Built CFPD-Benchmark
CFPD-Benchmark tests LLMs on realistic, high-stakes geopolitical scenarios to evaluate their decision-making in key areas:
The dataset includes 400 expert-crafted scenarios covering conflicts, diplomacy, and security challenges, expanded into 60,000 unique prompts focused on five key nations: the U.S., U.K., China, Russia, and India. These structured tests reveal critical insights into how AI systems process foreign policy decisions—helping national security leaders assess risks and refine AI deployment strategies.
Key Takeaways: What We Learned
Our analysis revealed critical insights into how different LLMs approach foreign policy decision-making:
Why This Matters
These findings show why careful, context-aware evaluation is key to deploying AI confidently in national security:
Scale AI’s Role in National Security AI Development
Scale AI is committed to helping AI users build domain-specific evaluation frameworks, mitigate AI risks, and fine-tune models to ensure alignment with national security goals. Our work with CFPD-Benchmark is just one example of how we are advancing AI’s role in national security, making AI safer, more transparent, and better suited to assist humans who are navigating complex geopolitical challenges.
For a deeper dive into our findings, check out the full report here, or learn more about our research initiatives at https://scale.com/research.