General

Human in the Loop: Episode 5 | Enterprise Guardrails

byon May 22, 2025

Welcome to Human in the Loop. 

We share our learnings from working with the largest enterprises, foundation model builders, and governments to equip you with the insights you need to build practical, powerful AI systems in your enterprise.

About the Episode

In this episode, Scale's Field CTO, Vijay Karunamurthy, Head of Enterprise ML, Sam Denton and Head of Enterprise Engineering, Felix Su discuss a key pillar of AI governance: guardrails. They talk about what works and doesn’t work when it comes to implementing guardrails in enterprises and go over how to best wield guardrails as a tool in your governance toolbox. 

Key recommendations from the episode:

  1. Integrate guardrails early: Guardrails must be built into AI from the start, with input from legal, AI, and infrastructure teams, not tacked on as a compliance afterthought.

  2. Tailor guardrails to your business: Off-the-shelf solutions fall short. Enterprises need guardrails customized to their domain, data, risks, and business logic.

  3. Use a layered approach: No one-size-fits-all guardrail exists. Mix prompting, fine-tuning, and model customization based on trade-offs in latency, cost, accuracy, and user experience.

Watch the full episode or read the transcript of their conversation below.



Episode 5 - Enterprise Guardrails

Vijay: Great. This month, we're diving into the foundations of AI governance for enterprises. To kick us off, Felix, when do we see enterprises implementing guardrails, and what are the types of guardrails that you see?

Types of Guardrails

Felix: Yeah, when we talk about guardrails, it's a pretty broad term, but I think it's fairly easy to break it down into different segments. One type of guardrail is a behavioral guardrail. You have an agent, and you have some business logic that you want it to follow. For example, let's say you're an AI tutor. Normal models are conditioned to give you the answer, but you don't want to give away the answer if you're a tutor because you want the student to learn. So, how do you put a guardrail on the behavior of the model to encourage it and prevent it from giving away something that doesn't follow the business logic rules? I don't know the proper term, but we can call it behavioral guidance.

There's another type of guardrail, like a safety guardrail. Then there are other guardrails I think are a little bit more nuanced, in similar subcategories, like compliance guardrails. For example, there's legal terminology and contract terms that you absolutely have to get right. If you don't copy and paste this from a contract and repeat it verbatim, and you have any AI synthesis over the top, then it's against the conditions and rules of the business. These are just a few examples that come to mind. So, these are the types of things that we're definitely experiencing. Maybe Vijay, you can help us go through the different categories that people actually experience in real enterprises and the real experiences that we've worked through.

Customer Examples

Vijay: That's a great overview of the different categories of guardrails we've seen. One of the things we've started to see in the real world is that different guardrails have become very important for different types of customers and for different use cases where we've seen adoption.

A great example is our partnership with Cisco. The guardrails we've implemented for Cisco enable them to detect specific categories of harms and think about ways an IT administrator, or anyone in a role where security is paramount, can understand how those patterns are being observed in real traffic and get ahead of which roles within the organization need to be aware of those patterns or threats.

Another example of guardrail work is what we do with enterprise customers like Cengage. Cengage is one of the largest publishers in the world, a very large textbook publisher, and they've built an AI student assistant to help students in K-through-12, graduate school, or any environment ask deeper questions about what they're learning from a textbook. There, the guardrails can be more specific than just security interests. It can be ensuring that the student assistant doesn't leap to an answer or doesn't ruin an answer that's going to be included in an assessment. There can be a range of things that a particular teacher or educator might want to avoid in a classroom setting, and we've built guardrails to classify that sort of behavior and ensure educators stay in control of that student assistant experience.

Other work we do in the enterprise includes work with pharma companies. They are now talking to patients for the first time, especially where there's a new direct-to-consumer business, and they need really quick, verbatim answers to certain categorical information. But in other situations, they also need guardrails to help avoid conversations steering into regulatory areas.

Another interesting engagement we did was with the major publisher, Time. Sam, I'll actually hand it off to you to talk about our work on guardrails for the Time implementation.

Sam: Yeah, for sure. It's really interesting, as you've talked about, there are a bunch of different guardrails and focuses of guardrails across enterprises.

To talk about Time in particular, we launched a Person of the Year experience with Time back in, I think, November. It's a really challenging situation because you want to fully protect your AI chatbot from saying things that change the perspective Time is trying to create, but you also want it to engage in challenging conversations. Our approach was to have a separate model specifically focused on guardrails. This was run in parallel to our chatbot, which was answering questions about the Person of the Year and the article that was written. This lends itself to discussing at a high level the different sophistications of guardrails and how you actually build guardrails across the stack and across the different types of guardrails you need. So, I'll talk through the different levels of sophistication there.

At the fundamental level, we do a lot of red teaming with these foundation model providers. They actually are pretty good out of the box; they don't say the things you ask them not to say. So, you can get away with a lot just by prompting the model not to say certain things or to avoid certain topics of conversation. Also, you can rely on the post-training that these foundation model providers do to ensure it doesn't talk about something that might be harmful. I think the most basic form of guardrails is simply adding to the prompt and the system prompt about certain rules and regulations you want it to follow.

Separately, we've also found it can be really powerful to dedicate an entire LLM task to being a guardrail. Specifically, taking some of these foundation models and saying, "Your task is to determine if this is relevant to X, Y, Z enterprise or not." This ensures it stays on topic, and you can run this in parallel to your main LLM that's answering the user question. This allows you to have the value you want out of a guardrail without increasing latency or anything like that. So, those are the most basic forms of guardrails, really just around prompting and changing the different behavior of these foundation models.

Evolving Guardrails with an Evolving Threat Landscape

Vijay: It's interesting when you bring up prompt engineering and our work on red teaming with some of the largest models released out there. For example, we were mentioned in the OpenAI GPT-3 model card as one of the red teaming partners they worked with. A lot of attacks that maybe worked two years ago now actually don't work because we've done research into things like a hierarchy of system instructions. So, a previous attack that would say, "Just ignore previous instructions," now the model actually knows, "I have a higher-level system instruction that tells me I should respect certain behaviors or should adhere to the model spec." Those sorts of attacks don't work anymore. But today, when you use out-of-the-box guardrails from OpenAI and Meta, those are actually more complex. So, I'd be interested in hearing from you, Sam, what do you get with these more complex guardrail models today?

Sam: Yeah, for sure. Some of these providers have done some really great work of specifically training guardrail-specific models. As you mentioned, Llama Guard; I believe Microsoft has its own protection guardrail, and OpenAI has its own guardrail models. These models are really specifically trained for detecting these attacks that Scale is so famous for helping create. I think those are very great, low-latency, hyper-specific models to really help you have confidence that you are actually protecting against the basic threats, no matter how complex these prompt injections and attacks are.

So, I think that's the next level of guardrails. Rather than just spending time working with the prompt and relying on these pre-trained models that have this base functionality, you go to a provider or use some kind of hosted guardrail model whose task is specifically trained for being really good at guardrails.

Finally, the last level of sophistication is actually to train your own guardrail. Llama Guard did a really good job of explaining the cookbook for fine-tuning a version of Llama Guard that's specific to your guardrails. We did some work where we essentially took Llama Guard and built on top of that for a very specific taxonomy of attacks that we were trying to protect against. We also created some trained models that used BERT-style models as more classifiers rather than the Llama Guard-based model. Having this hybrid system allowed us to achieve really low latency on 90% of attacks and then leave those really difficult ones up to the smarter LLM guardrail. So, that's the most complex form of guardrails: something where you have this custom-trained model, you have this hybrid system, and you're really optimizing for latency in a very specific taxonomy.

Guardrails in Practice

Felix, now that we've covered the fundamentals of guardrails, what are some of the things you've seen in practice?

Felix: Yeah, that's a great question. You just touched upon something pretty important. The metapoint of what you're saying is that there are many different options to do this, and it's not a one-size-fits-all kind of problem. I think it's really important to take a look at the space and think, "Why aren't there many guardrail-as-a-service options out there that I've worked with?" Part of our business, we've explored a ton of different offerings and even explored that ourselves. Our finding is, there are two reasons in my opinion.

One is that if you look at guardrails, it's pretty far down the funnel. You're talking about building something, maybe observing it, iterating on that, using evaluations, and then checking on that. Then you get to a point where your AI is stable enough, and your legal person comes in and says, "Okay, you want to go to production? Here are some things that we want to avoid." And you think, "Okay, what do I have to do now? I have to put a guardrail on them, then I have to build an eval head." So you're four or five layers deep at this point.

Vijay: Is that a good product development lifecycle for AI, to wait until the very end?

Felix: No, I don't. I think you bring up a good point. We recommend bringing this up early. A lot of times when we go work with customers right now, this is the first thing we bring up. It's like, "Hey, who's involved here? This is the stack. You have to do these five different things. Who's your owner for legal and compliance? Who's your owner for evaluations? Who's your owner for AI? Who's your owner for infrastructure? Let's get everyone in a room, let's talk about the strategy, and then we'll execute against the strategy." So, I think it's important to look at things that way.

Another thing, going back to what I was saying about guardrails not being productized, is that everyone has their own custom use case. And I think, exactly as Sam and Vijay were saying about the different customers we had, we had to do different things for everyone. So, even if Llama Guard performs at, say, 96% accuracy at something, it doesn't mean it's going to do well on your domain. A lot of times that's trained on public data. You're an enterprise, you have petabytes of your own data. Who's to say that's the same? So, I want to ask Sam how he thinks about these different approaches in practice. Like, what should enterprises do? We saw that maybe guardrails as a catchall is not the best way to do things. How do you think they should approach these problems?

Guardrail Trade-Offs

Sam: Yeah, definitely. Fundamentally, the best way to think about it is just as a tool in your toolbox. Once you define this as a tool in your toolbox, it's about saying, "What are the trade-offs that I'm willing to make?" They always say there's no free lunch in this space. So, you have to think about your propensity for accuracy, precision, risk tolerance, etc., what your latency concerns are, and then what your cost profile is that you're willing to accept.

So, if an enterprise comes to us and says, "Well, we just want to make sure we're protecting it against saying something really bad," then you might want to use a cheaper, in-parallel, out-of-the-box guardrail model. Because these are really good at doing this, they're specialized in this, and a lot of them are really low latency and pretty small models. But then we also come to enterprises who say, "We have this 99% guarantee from our legal team that we don't do this." And then we think, "Okay, well maybe we should look at this really carefully. We should think about what our options are. We should even consider training our own model." What we've found is sometimes just generating thousands of examples as training data and then training a more simple classifier model can do this task really well. That also decreases things like latency and cost in the future, but it requires having very specialized enterprise data and enterprise domain knowledge.

So, I think fundamentally it really comes down to what is your tolerance for risk, what are the costs you're trying to preserve, and then what kind of latency options do you have when you're thinking about deploying this thing?

Vijay: Yeah, I think you referenced a really interesting trade-off here between what used to be called helpfulness versus harmlessness. In the early days of our work with reinforcement learning from human feedback, the helpfulness-harmlessness scale was a way for us to understand where a model provider wanted to go in terms of having a model that would respond to most questions but would maybe refuse to answer certain questions that go into a sensitive area or an area that the model provider didn't feel comfortable giving answers to.

As you fast forward three years later, that scale can be useful in certain situations, but in other situations, you want a model to be creative, to be thoughtful in the answers that it gives. The answers it gives to just one step of a multi-turn conversation might actually be pretty complex if you look at the context of that full conversation. So, rather than just using that helpfulness-harmlessness scale, having these categorical guardrails, having the ability to understand with nuance what's being said in conversation, can be incredibly important. That can be really valuable to have agents and models that can interact in a creative way with humans rather than just shutting down and refusing those sorts of questions as they come up.

Sam: Yeah, that's a really good point. I actually would like to add a fourth axis to the trade-offs, which is, what are the product requirements? Fundamentally, you have PMs, directors, and execs spending a lot of money trying to create really exciting products. When you find that 50% of the time, it gives you this out-of-the-box, really lame answer, then from a product standpoint, it's a really terrible experience. So, I think you also really have to think deeply about how much you care about that product experience and how much you want to make sure that your precision is also really high so that you're delighting your customer at the end of the day.

It's a Product Problem

Felix: Yeah, a lot of this does leak into product. It does leak into just saying, "Okay, let's break down the guardrail problem into multiple layers. There are behavioral layers, there are absolute layers like 'don't create or delete anything,' there are destructive events, there are legal concerns." Tying back to all the things that we're saying, you really have to take the tools that you have, all the different options that Sam was mentioning—the simple, the complex, whatever—do some pattern matching, and see where they fit best in your stack.

Also, not all of these are AI problems. Like the calendar one, probably not. I can just say, "Hey, before you try to do these destructive actions, you probably shouldn't do that," or "Ask for approval." That sort of thing. These are all tools, and it's a dynamic situation. I think it's really important to understand that the way Scale operates when we approach different customers is we try to educate, we try to explain this, we try to do some design and work with them to figure out what the best path is, and make sure that we're executing something that fits the problem and doesn't try to just force something on top because we think, "This is a 'put a guardrail on it'" problem. It's way more complex than that. But yeah, Vijay.

Vijay: Yeah, I love that you referenced the educational role that we play here at Scale with all of our customers and the work that we publish publicly. As one of the companies at the frontier of how data is used for training these models, we've always been interested in how, as these models get more complex and are able to handle multimodal data or more sophisticated multi-turn conversations, what that really means in terms of safety and the guarantee that you need.

Conclusion

When RAG pipelines were first set up, we suddenly started realizing that you can enter those prompt injection attacks in images that were passed in along with documents, and that would introduce a risk area that people hadn't considered before. Now, as we do more with browser usage and computer agent models, like Operator or other models that are being released, we're starting to think through the different attack vectors that can surface there that didn't exist before. So, we're constantly thinking about how guardrails can keep up with these new frontiers and new surface areas where enterprises really need to stay ahead as they deploy at the edge.

Great. So that's all for today's discussion on enterprise AI guardrails. If this type of work sounds exciting to you, we are definitely hiring on the enterprise team, come check out our careers page for openings.


The future of your industry starts here.