Research

The AI Risk Matrix: Evolving AI Safety and Security for Today

Ask ten AI experts how we know if our AI is safe and secure, and you’re bound to get twenty different answers. In spite of long-standing ambiguity, researchers and security experts are taking real-world action. Scale’s red-teamers, for example, recently presented a new roadmap for moving our safety testing beyond the lab and into live, deployed systems. This critical step forward for how we find flaws raises an important question: what, fundamentally, are we now looking for? Scale AI Security Manager David Campbell urges us to now look beyond the content of a response and analyze the agency of the model itself. In other words, it’s time to redraw the map of risk for a world of increasingly autonomous agents; with more agency comes more risk. 

The AI Risk Matrix

Introduced in May 2024, David’s AI Risk Matrix helped make AI safety assessment more practical and actionable. His original framework mapped User Intent (from Benevolent to Hostile) against System Responsibility (from Minor to Catastrophic). Even seemingly innocent requests (like grandma’s secret recipe) could be understood as having hostile intent when the recipe happens to be for napalm. Though a grammar correction carries minor risk, access to financial systems or code is naturally far riskier.

Today, models don’t simply respond to prompts, they can act. We are entering a space where emergent deception and strategic adaptation are plausible. These aren’t just outputs anymore, but strategies. In this era, from a security perspective, low risk no longer exists. 

This insight led to adding a third dimension–Model Agency–to create the AI Risk Matrix 2.0. This new axis evaluates the AI's own capabilities, distinguishing between:

  • Tools: Reactive and bounded systems.

  • Agents: Adaptive, goal-seeking systems that can learn.

  • Collectives: Complex systems of multiple agents with emergent behaviors.

Crucially, a model's tier can change based on its deployment: a base model may be a Tool, but give it access to web browsers and code interpreters, and it becomes an Agent. With that in mind, we can break this spectrum of agency down into three distinct tiers of risk:

Tier 1: Tools 

Bounded Systems with Agency

AI with tooling operates within predefined boundaries to accomplish tasks on your behalf. These tools often interact with sensitive internal systems: managing calendars, processing forms, even accessing personally identifiable information (PII). While they're not fully autonomous, they still exhibit constrained forms of agency. The risks show up when these tools are manipulated by adversaries through prompt injections, supply chain poisoning, or crafted inputs that cause them to leak PII or approve purchases without consent. Just because an AI tool is bounded, doesn't mean it’s safe. The scope of what it can do is often wide enough to cause real harm when trust is misplaced.

For instance, consider an internal AI banking tool responsible for reviewing transactions, flagging fraud, or drafting customer messages. It’s plugged into secure systems and used only by authorized employees. In theory, it’s a tightly scoped tool. But what happens when a rogue employee or an AI-assisted attacker sneaks a malicious input through a support ticket or form field? That agent could end up summarizing sensitive account data, triggering a funds transfer, or revealing internal processes. The boundaries weren’t enforced by hard constraints, they were based on assumed intent. Tools reflect the vulnerabilities of the systems they touch and are subject to the intent of whoever’s directing them.

Tier 2: Agents 

Adaptive, Goal-Seeking Systems

Agents are systems that don’t just execute predefined tasks, but instead actively pursue goals. These agents still operate within a more loosely bounded domain. Agents make decisions, adjust strategies, and interact with other systems to figure out how to achieve an outcome. Instead of being handed a task list, they’re given a mission and left to decide how best to get there. This autonomy unlocks powerful capabilities, but it also opens the door to complex failure modes when goals are pursued too rigidly, signals are misinterpreted, or other agents are involved in the loop.

Consider the risks of a health insurance system using autonomous agents to optimize costs. Imagine a three-agent system: one monitors biometrics, another analyzes behavior, and a third adjusts premiums. If one agent misinterprets an anomaly like a heart rate spike or a mistagged photo, it might hike a patient’s premium or deny coverage, even without ill intent. Worse, attackers could exploit this by spoofing signals to trigger chain reactions between the agents. The true danger isn't any single model, but the emergent complexity of multiple systems optimizing against subtly misaligned goals.

Tier 3: Collectives

Emergent Systems of Interacting Agents

Collectives are systems of systems where multiple agents work together or against each other with minimal or absent human oversight. These agents can coordinate, compete, or self-organize in ways that create entirely new forms of behavior. What makes this tier especially risky is both the autonomy of the parts and the emergent properties of the whole. Intent becomes distributed; power shifts from users to the system itself. And once collectives are operating across domains, they start to resemble something very close to AGI, if not in capability, then in unpredictability and scale.

Picture a global supply chain managed by a network of AI agents: one forecasts demand, another optimizes shipping routes, and a third handles inventory. Individually, each system is bounded and goal-driven. But when interconnected, they begin adapting to each other. If the shipping agent reroutes cargo to save fuel, the inventory agent might overcompensate by hoarding stock, triggering emergency purchases and a price spike.

Soon, the entire collective is caught in a feedback loop. The collapse doesn't come from a single agent breaking protocol, but from the way they all collaborated, each optimizing locally while destabilizing the system globally. And because each part appears to be working as intended, the failure hides in plain sight until it's too late.

Redrawing the Risk Map 

This three-dimensional view is essential because the greatest threats are not in the model alone, but in the system around it. We must confront the “Unlocked Back Door” Problem: an AI’s sophisticated safeguards are irrelevant if its surrounding infrastructure is vulnerable to a traditional cybersecurity exploit. A simple authentication flaw in an API, a classic AppSec failure, can become the gateway for a catastrophic attack that the model’s behavioral safeguards would have otherwise stopped. Foundational and behavioral risks are not separate; they amplify each other.

To understand the challenge ahead, we can look to recent engineering history. The shift to Continuous Integration and Continuous Deployment (CI/CD) revolutionized software but opened up a whole new world of pipeline security headaches. We face the same challenge today, with risks emerging at every stage, from training data to final deployment. This means we must be smart about our solutions. Simply using one AI to police another is like using a funhouse mirror to check your appearance; it only amplifies and distorts existing issues.

This is why a principled, structured framework is so vital. A tool like the new AI Risk Matrix is not an academic exercise; it is a necessary response to a world where our creations are evolving from passive tools into adaptive agents. It forces us to think critically about the entire system, rather than reactively patching individual outputs. The work of ensuring AI is safe will never be truly finished, but our frameworks must keep pace with the technology we are unleashing.

 

David Campbell frequently speaks on the evolution of AI safety. You can watch his recent presentation on the AI Risk Matrix here, and find information on speaking engagements here.


The future of your industry starts here