Why the U.S. Needs an Independent AI Evaluation Framework for National Security

Last month, Anthropic released Claude Mythos Preview, a model the company claims can write functional hacking tools capable of breaking into well-protected systems. OpenAI quickly followed with GPT-5.5-Cyber. In both cases, early access was controlled entirely by the model developer. A small group of privileged organizations got time to prepare. That may be responsible corporate behavior, but it is no substitute for the government knowing — independently and in time to act — whether a new model creates risks for cyber defense, biosecurity, critical infrastructure, or military readiness.

Despite being well intentioned, private companies should not get to decide who is prepared.

These models, and even more powerful ones to follow from U.S. and Chinese model makers like DeepSeek and Alibaba, pose new and serious risks to national security. The pace of innovation is exposing a critical weakness in U.S. capability: Washington does not yet have a standing, independent evaluation capability to assess what these models can do, how dangerous those capabilities are, and how quickly the government and industry need to respond. At present, the U.S. government depends too heavily on frontier model builders to assess and disclose the risks in their own systems.

The United States urgently needs a new AI Evaluation Framework. Not to stop innovation, but to gain independent insight into the risks that matter most, before those risks arrive. The challenge is building the framework to assess and mitigate risks at the pace of innovation.

We've Seen This Work

Since 2023, Scale has worked with the world's leading AI model builders, enterprises, and a growing network of government stakeholders to build and run the evaluations that matter most. In February 2025, Scale partnered with the U.S. Center for AI Standards and Innovation (CAISI), jointly developing new evaluation methods for frontier models. Scale Labs builds the benchmarks and conducts the technical testing. The government provides subject matter expertise and interprets the results for policy and standards. We’ve run similar models with Korea, the U.K. and other governments.

This work has taught us that general capability benchmarks tell you if a model is powerful, but they do not tell you whether it can help an adversary synthesize a pathogen or compromise a power grid. That requires purpose-built evaluations designed around specific national security risk domains, areas where the government can define the points of failure. But expertise alone isn’t enough. An effective framework needs the government at both ends, setting the terms and acting on the findings.

The current system has three structural problems:

The Government may never get the information they need in order to make informed decisions
What gets tested, how it is tested, and what gets disclosed is controlled by the model builder
Lab conditions are not real world conditions. Adversarial users find weaknesses builders never anticipated and novel use cases emerge that no benchmark was designed to capture

A Better Framework For Our Time

The framework to fix these gaps already exists in practice and consists of three main actors:

Third-Party Evaluators

Third-party evaluators are the backbone. Their job is to build the benchmarks, design the evaluations, conduct the technical testing, and deliver findings to the government.

But evaluators cannot wait for a crisis to begin building the tests. The government should prioritize the highest risk domains and commission evaluations before those risks become public emergencies. This framework only works if the government starts defining what it needs to know.

Government Agencies

Government agencies offer expertise and interpretive authority that no outside evaluator can replicate. The right experts for any given evaluation are determined by the domain being assessed. A cybersecurity benchmark requires agencies that understand offensive cyber operations. A biosecurity benchmark requires agencies with expertise in dangerous pathogens. After the Government interprets the results, they can then disseminate that intelligence to other agencies, critical infrastructure operators, and industry — so that the people responsible for defending against these capabilities can prepare before a model is widely deployed.

Model Builders

Model builders provide access to their models under structured agreements. Without it, evaluation is reactive, causing Governments to play catch up on new capabilities after they are already deployed. When Model builders provide early access, they receive findings they can act on, while the government gets the visibility it needs.

Model builders have already demonstrated a willingness to engage with government and industry ahead of major releases — Anthropic's decision to limit access to Mythos Preview is evidence of that instinct. The problem is that these efforts are ad hoc and inconsistent, not the systematic approach that should be required.

The Path Forward

Standing this framework up can happen immediately and requires no new legislation or agencies. The government defines the risk domains it wants evaluated.Third-party evaluators are tasked to build and run the evaluations against those domains. In parallel, model builders formalize the pre-deployment access agreements that make evaluation possible, several of which already exist with CAISI.

Progress isn’t gated by bureaucracy, and risks can be impartially evaluated.

Foreign Models

The framework will not only work for domestic releases, but also for when a Chinese lab releases a new frontier model. Right now, the same challenge exists but Chinese labs will not voluntarily provide the United States access. Despite that, the United States should be able to evaluate its capabilities within days and right now, that capability doesn't exist. By the time anyone outside a foreign lab understands what its model can do, it has already moved through global supply chains and into the hands of enterprises, governments, and adversaries around the world. This framework can enable that to occur.

The Moment for Action

The evaluation gap that exists now will only widen as capabilities grow. The time to build this infrastructure is now while the framework can be designed deliberately, the benchmarks can be built proactively, and the agreements between government, evaluators, and model builders can be formalized before the moment of crisis rather than during it.

AI Is Moving at the Speed of Innovation. Our Ability to Evaluate It Must Too.

We've Seen This Work

A Better Framework For Our Time

Foreign Models

The Moment for Action

Ready to break through your data bottleneck?