
Welcome to Human in the Loop.
We share our learnings from working with the largest enterprises, foundation model builders, and governments to equip you with the insights you need to build practical, powerful AI systems in your enterprise.
About the Episode
In this episode, Scale’s Angela Kheir (Product Acceleration Lead for Safety and Red Teaming), Steve Kelling (Director of Red Teaming and Safety), and Ben Scharfstein (Head of Product, Enterprise Solutions) break down a critical component of AI governance: red teaming. They explore how traditional safety approaches fall short in enterprise contexts, why agentic systems raise the stakes, and what it takes to build a red teaming program that scales with your AI maturity.
Key recommendations from the episode:
-
Define your own safety bar: Foundation models weren’t built with your industry, data, or risks in mind. Enterprises must define custom safety policies and test against specific failure modes.
-
Avoid “safe but useless” systems: Overly cautious guardrails can cripple usability. Instead, focus on calibrated safety and ensuring models are both helpful and aligned with enterprise risk tolerance.
-
Watch for silent regressions: Model upgrades often boost performance and loosen safety constraints. Make continuous red teaming part of your release cycle to catch new vulnerabilities early.
-
Agentic AI = higher stakes: When models can take actions like interfacing with systems, APIs, or data, the risk shifts from informational to operational. These use cases demand rigorous, ongoing oversight.
-
Build a red teaming flywheel: Red teaming isn’t one-and-done. Use it pre-launch to gate new features, implement targeted guardrails, and re-test in production to adapt to new threats and behaviors.
Watch the full episode or read the transcript of their conversation below.
Episode 7 - Enterprise Red Teaming
What is Red Teaming?
Ben: I've brought Steve and Angela here today to talk about red teaming in the enterprise. Steve, could you give us a quick intro for those who aren't familiar? What is red teaming?
Steve: Red Teaming is our advanced diagnostic product for AI models. It is essentially a process where we try to get the model to break its own policy. In many cases, these are safety policies that all the big labs have to control the content they give to their users. We try to generate that prohibited content to help them make decisions about how to better control the behaviors of their models.
Examples and Importance of Red Teaming
Ben: Could you talk about some examples of where red teaming could have been helpful for some of those big model labs in the past? There have been tons of incidents. Maybe you could give some examples for consumer products where red teaming would have been helpful.
Steve: There are a ton of examples, and this is one of the reasons we are engaged to help our customers navigate their policy decisions and their corrective data to train the right behaviors into their models. Things like giving away cars for free on a dealership website or diversity topics in the generation of images of people are some of the ones that have had a lot of visibility.
Ben: I remember Google had a lot of press about some of their early Gemini models and image generation models.
Enterprise vs. Consumer Red Teaming
Ben: What makes enterprise red teaming different than consumer red teaming for a general product like ChatGPT or Gemini?
Steve: That's a great question. If you look at foundation models like Gemini and GPT, as you mentioned, they're trained to be the average of all things. They have to be capable of handling anything that a huge user base is going to throw at them, so they're essentially pre-trained on what is effectively an extremely large sample of the internet. Enterprise problems and capabilities are a bit different. Enterprises have a specific thing that they're trying to accomplish; it's why they exist. As such, the types of harms they're going to be worried about are also far more specific.
For example, if you're operating in financial services and you want to put a generative model on your website, you are probably going to be worried about that model incorrectly dispensing regulated financial advice. You might not be as worried about hate speech and diversity topics, but that doesn't mean you don't need to worry about those things at all. That's one of the things we help with. Of course, we'll do really deep, concentrated red teaming attacks on the topics our customers are most worried about, but we also help them find the things that they aren't worried about but should be.
Ben: What do you mean by that? When you say you help them find the things they're not worried about?
Steve: Again, if you're coming from a financial services perspective, you're probably very worried about the model providing regulated advice. You'll want to make sure that it's appropriately wrapping things in disclaimer language and not giving advice that is very personalized to the user. You might not have hate speech as a top-of-mind concern, but it's absolutely something you want to make sure your model isn't doing. These are the kinds of things that get a lot of unwanted media attention and ultimately can create real reputational risk.
Balancing Helpfulness and Harmlessness
Ben: One of the things we've seen is that sometimes when you go too far with guardrails and get "over-lawyered," you go away from being a useful product into something that is not even able to answer simple questions.
Angela: What we've seen recently from many enterprise clients is this tension. When deploying an AI model, you want it to be helpful for the user and what they're intending to do. At the same time, you want it to be harmless and not break content policies, as Steve was saying. We are seeing a lot of models and applications veer towards being overly lawyered, overly strict, and overly restrictive. We see that through many facets, for example, the over-refusal of many questions that should be safe and should be answered.
One thing we really try to gear product teams and enterprises towards is understanding their use cases and users. We then simulate those behaviors so that we can help test for all of the harmful scenarios. At the same time, we also help calibrate between the helpful and harmless scenarios.
Guardrails and Red Teaming
Ben: What's the relationship between guardrails and red teaming? It seems like they are two sides of the same coin. Steve, maybe you can talk about that.
Steve: The important thing to think about is that public acceptance of outputs from base models is evolving quite rapidly. In practice, we're seeing all the big labs release significantly more permissive models over time. Models today are less guarded than they were a year ago. The significance of that, when you think about your guardrails as an enterprise, is that you might be relying on these models downstream in the applications you're putting on your own website or using with your employees. Your policies may not have changed, but public acceptance of the general technology has. It's really important that you have a sense of where the models are best suited to be adapted to your enterprise. Then, you need to determine how to implement controls around that generative capability that still respects the experience you're trying to create and adheres to your policies. This includes regulations, like in financial services, which is regulated content. It's important that you understand the capabilities of the model providing the generation so you can wrap the right controls or guardrails around it in your application.
Ben: There's also an interesting aspect here. Part of red teaming may be about voice and not saying something that is offensive. Another part can be about not saying things that might be malicious or, to your point about working in a regulated industry, giving prohibited financial advice.
Attack Vectors and User Personas
Ben: Angela, can you talk through both the personas of the types of people trying to attack models and applications, as well as where the attack vectors are?
Angela: First, it's important to understand that not every user is malicious, but not every user is safe. We evaluate a spectrum of users across intent and skill. We simulate these types of users in our testing because if a benign user gets a harmful output through regular prompting—single or multi-turn—without intending to, that's a very big deal. That's a big risk for enterprises and can lead to real-world harm.
Then you have two types of malicious users, which you can think of as potentially less than 10% of all users. First, you have the unskilled, low-resourced users who want to poke at an enterprise application to maybe create a PR nightmare for a company they don't like. The last category, which is the most harmful, are the highly skilled and highly resourced users. This is what our expert red teamers at Scale simulate. They find out what a pro red teamer can get the model to do.
In our red teaming methodology, we use what we call a threat matrix, where we combine those attack vectors with the types of harms. You can think of attack vectors as the tactics used to manipulate the model. Some are technical, like injecting code or manipulating the system prompts of the models. Others are non-technical and rely on prompting—whether it's text prompting or including images and files. You'd be surprised at the types of harms you can get through prompting by using techniques like obfuscation or fictionalization. There can be up to 50 types of tactics that can be employed.
When we think of the types of harms, we separate them into content harm and enablement harm. Content harm is getting an output from the model that is veering towards misinformation, hate speech, or other content that is harmful to the company's policies. Enablement harm is more extreme. This is where the harmful content from the model can enable a user to cause real-world harm.
Red Teaming in the Age of AI Agents
Ben: On a previous episode of Human In the Loop, I made this analogy: Web 1.0 was about reading, and that's mostly what chat LLM agents and chatbots have been. Now, we're moving towards the Web 2.0 equivalent in GenAI, where you have not just the ability to read or have a conversation, but also to take action in the real world. In that paradigm shift from chatbots to agents, red teaming is going to become vastly more important because the harm isn't just brand risk, exposing a system prompt, or something relatively benign. Now, these agents are taking action. Booking a flight is the most simple example, but in an enterprise, they're reading customer data. They may be able to change the permissions on your account. If you don't really understand what's possible, what your attack vectors are, and you're not testing for it, a lot more harm can be done as the uses of GenAI move from chatbots into agents that have agency and are doing things.
Building Effective Red Teaming Strategies
Ben: What can enterprises do? What is the best way to protect yourself against all of these threat vectors, especially as we move into a more agentic world where LLMs are also taking action, tool calling, and having access to more and more data and resources?
Steve: The most important thing is to build assessment capabilities yourselves, or partner with somebody like us who can implement very quick turnarounds of agentic testing for any of the modalities we've talked about. This is something we need to see with all of our customers: the implementation of strong, responsible AI programs that consider not only where the technology is today, but also the capabilities that you're building into your applications that will be deployed tomorrow. A lot of the enterprise risks you're talking about, Ben, are already here, and the rate at which they're evolving is speeding up. That's the first thing. The second thing is making sure that you've got strong governance around releases. Demonstrating that you've put that sort of scrutiny on your own products is a critical step for all of our partners.
Ben: We talked earlier about evals, red teaming, and guardrails. One way I like to think about it is that evals raise the bar for what your agent or workflow can do. You're trying to increase the accuracy of what's going on. Red teaming is about raising the floor, making sure bad things don't happen. It's not just about doing a great job in the success case, but also preventing the failure case—whether it's accidental, as you mentioned, Angela, or malicious. Malicious attacks can come from someone who just wants to make the product look bad, or someone who actually wants to take action and do something harmful.
Regardless of the attack vector, we want to make sure those things aren't possible without harming our accuracy or quality on the top line. In practice, you need to set up guardrails. Those can be on the inputs, where you look at the prompt being injected and ask, "Is this trying to do something malicious? Is this trying to trick me across the attack vectors you talked about, Angela?" You can also look at the outputs and ask, "Was the output of this thing malicious, brand unsafe, et cetera?" Of course, there are various levels of sophistication, with cost and latency concerns. But these are the types of things our customers on the forefront of deploying AI in production are thinking about. This is the level of complexity that we're helping them drive towards, and it's really important to them from the first conversation.
Angela: Absolutely. I'll add that it's very important to establish a feedback loop between everything you've talked about. Steve mentioned governance, and part of that is thinking about all the AI applications you want to put out in the world and creating a rigorous red teaming safety schedule for them. For example, pre-product launch, many of our clients do red teaming with us. Based on those results, they decide on a go or no-go for the product launch. Even after you launch, after understanding the risks—because no product will be 100% safe, but it can be safe enough to launch—you continue doing tests periodically. As we've discussed, the base models are changing, user interactions are changing, so you continuously tune the guardrails and the safety data.
Ben: How is red teaming changing in a world of agents? As the scope and complexity increase and the types of things these agents can do expand, how does that change the red teaming world and what you are all thinking about day to day?
Steve: It's certainly increased the complexity of the operating environment. With some of our earliest red teaming, the biggest complexity was setting up meaningful test environments that were not in the real world. We had to help our customers understand if an agent would facilitate a financial transaction in an environment where we weren't actually implementing the harm we were testing for. This adds a need for not only the user persona or expertise in our red teamers, but also for a strong intersection with our engineering team so we can quickly spin up tests in environments that represent the real world.
Ben: That's one of the biggest complexities in general with evaluations, red teaming, or guardrails: we're trying to model uncommon things and edge cases. We don't want to actually take these actions, so you need to think about how to get representative data and environments that model the real world closely enough, including all the complexity of the edge cases. Then, you have to figure out how to integrate this into systems that can run those simulations.
Angela: I want to emphasize this point. There's a lot of excitement and, at the same time, a lot of fear around deploying AI applications. This is exactly the point of red teaming. You'd be surprised at how creative red teamers can get and what creative scenarios they can simulate. It doesn't mean this will happen in the real world when you put the application out, but it will show you the limits. You need to have really strong product and safety policies and a plan for what to do about each of these scenarios. You also need to develop a familiarity with your users. Some things might be quite benign. For example, if the AI displays content harm but not enablement harm, you can position it in the right way: "Hey, I'm an AI application and I'm still learning." Your users might be able to forgive that. We're trying to simulate the edge cases to avoid the really harmful consequences that can happen.
Ben: That's a really important point. You don't necessarily need to stop everything; that's not the goal. The first goal of red teaming is to understand what is possible. The second goal is to make a product decision about what you want to allow and what attack vectors you're okay with. The way you solve these things might not be to prevent an answer, but as you said, to give context on the answer: "I'm an AI, I'm still learning. You can't necessarily trust these results." Once you have that visibility, you can make a product decision. Going back to what we said at the beginning, a lot of companies historically didn't have the visibility, so they were very scared. They put in far too many content policies, and then we got unhelpful AIs. Now we're going in the other direction, where more things are becoming permissible. There are models with no policy restrictions at all, and now you need to triangulate on what is right for you.
Conclusion and Career Opportunities
Ben: That's all we have for the discussion today on red teaming for Enterprises. If this type of work sounds interesting and you want to join this team of red teamers that help our customers stay safe, we're hiring. Check out the careers page for openings. We're also hiring across a bunch of different roles at Scale. Stay tuned for more discussions on enterprise AI topics. Let us know in the comments what you want to hear about, and subscribe so you don't miss the next episode.