Research

It’s Time to Rethink Red Teaming

The increasing complexity of advanced AI systems, particularly new forms like domain-specific agents and LLM assistants, is rapidly outpacing current safety testing. Traditional lab-based evaluations often fail to reflect real-world deployment environments, missing crucial vulnerabilities introduced by these sophisticated systems. In response, Scale researchers, in their new paper “A Red Teaming Roadmap Towards System-Level Safety,” propose that red teaming practices must evolve to align with real-world usage and operate directly within AI products and systems in their actual deployment contexts.

Specifically, the paper outlines that:

  • Red teaming should prioritize testing against clear product safety specifications. 

  • Red teaming should be based on realistic threat models rather than abstract social biases or ethical principles. 

  • There should be an increased emphasis on system-level safety, looking beyond the AI model in isolation to consider the entire deployment context, including integrated tools, user interactions, and monitoring mechanisms.

This post provides an overview of the argument for expanding the scope of red teaming, how this approach aligns with our previous research, and what it means for the wider AI community.

Models ≠ Products ≠ Systems

To red team effectively, we must first define what is being tested. The authors distinguish between models (neural networks trained to perform actions), products (applications built on models), and systems (products within their broader context that includes users and their environment – for instance, a university student using an educational app built on top of a model like ChatGPT). This distinction matters because each presents unique safety requirements and potential harms. Downstream developers often have specific safety specifications driven by their use cases that model-level safety alone cannot address. Therefore, the authors argue, red teaming should prioritize safety vulnerabilities within defined product scenarios rather than focusing solely on abstract model-level harms.

This framework reveals significant gaps in current red teaming practices. Most safety efforts remain model-centric and neglect deployment context, overlooking vulnerabilities that only emerge when a model operates within a product or system. The authors also critique the excessive focus on abstract social biases when product safety with realistic threat models would be far more impactful. While preventing biased outputs matters, greater risks lie in AI systems that can be exploited for concrete harm - manipulating users, leaking sensitive data, or enabling fraud. Rather than theoretical concerns driving safety priorities, actual deployment risks should guide where we focus our red teaming efforts.

The Roadmap

To help guide our community toward a more holistic approach to red teaming, the authors suggest an approach that is more product-focused, grounded in realistic threats, and system-aware. These are the three pillars that define this approach:

  1. Center on Product Safety Specifications

The foundational shift, according to the paper, is moving from "universal" ethical principles to specific safety requirements tailored to each product's actual use case. There is no useful, widely shared definition of "harmful" behavior; what's dangerous in a medical advisor could be acceptable in a writing tool. Products diverge from their underlying models based on their users, business models, integrated tools, and deployment environments. A benign language model becomes potentially hazardous when given access to code interpreters, web browsers, or payment systems, much like household cleaners that are safe individually but dangerous when mixed. Red teaming must evaluate specific, actionable safety specifications unique to each product. This means probing every component users interact with, from the UI to the tools that could be exploited for unintended purposes. 

  1. Embrace Realistic Threat Models

Red teaming research is not merely a rote exercise; it instead must be relevant to real-world harms and reflect what motivated attackers might realistically attempt. Though there are many different types of threat models, this paper uses four overarching example categories of increasing complexity that must be addressed differently:

  1. LLM Chatbots, as demonstrated in a previous Scale research paper, require red teamers going beyond single-turn testing which does not generalize to multi-turn scenarios. Most attacks occur over longer conversations when users have more flexibility to undermine safeguards which must be accounted for in testing. In another paper, our researchers shared their findings regarding automating this kind of red teaming with LLMs, making this process that depended on experienced human red teamers far more scalable. 

  2. Audio Assistants require multi-turn testing as well, but have additional layers to account for such as tone, accents, sarcasm, background noise, interruptions, and even metadata cues like audio fidelity that all need to be accounted for. Simply using text-to-audio models for red teaming misses critical aspects of the attack surface. 

  3. Video Generators present unique challenges since harm can evolve from an innocuous prompt across multiple frames. Harmful content might be distributed across frames or involve synchronized audio, making outbound monitoring more complicated and expensive. Context-dependent harm is also more of a problem since content that would be fine in text can become problematic rendered visually.

  4. Autonomous Agents are by far the most complex because there is a vastly increased attack surface, with attacks possibly arriving through many different input mechanisms (text, audio, images, video, tool outputs, etc). The multi-modal attack possibilities mean vulnerabilities can emerge more easily than from the above categories, and have the additional complexity of software components like Model Context Protocol (MCP) servers and multi-agent systems. In another paper, Scale researchers address this concern with Browser Agent Red-teaming Toolkit (BrowserART), a comprehensive test suite designed specifically for red-teaming browser agents.

  1. Advance Toward System-Level Safety

Red teaming entire systems involves incorporating the environment, users, and the AI product's interactions within that ecosystem. This is essential for addressing modes of harm and implementing stronger safety practices. System-level safety requires:

  • Environment modeling and simulation must take into account: human users, digital and physical worlds, and other AIs. Tool outputs and adversarial environmental elements also form critical attack surfaces that need red teaming.

  • Trajectory and user monitoring goes far beyond flagging single harmful requests, but aims to detect harm based on entire output trajectories, including the history of agent actions and user interactions. 

  • Rapid response to safeguard failures so that as vulnerabilities are detected, they can be patched quickly, allowing developers to deploy safeguards against newly discovered failures before they are exploited at scale.

  • Red teaming the monitor by testing the entire systems themselves, with red teamers attempting to complete adversarial tasks without being detected by the monitor. 

By allowing these three pillars to function simultaneously, the authors believe that red teaming can become far more effective, relevant, and impactful to help ensure AI systems are developed and deployed safely and responsibly.

From Theory to Practice

This shift toward system-level red teaming requires changes in how safety is currently approached. Security teams must model realistic attack scenarios rather than abstract harm categories. And oversight systems need to monitor entire user trajectories and environmental contexts, not just individual outputs. The implications extend beyond individual companies. As AI systems become more capable and autonomous, the gap between laboratory testing and real-world vulnerabilities will only widen. The industry needs standardized frameworks for product-specific safety specifications and realistic threat modeling. Red teaming is at an inflection point—it must evolve from academic exercise to practical discipline focused on how AI systems actually fail in deployment. This paper offers a roadmap for making that transition while the risks are still manageable.

 


The future of your industry starts here.