Today on Human in the Loop, Angela Kheir, Dani Gorman, and Emily Xue are talking about what happens when enterprise GenAI goes wrong. The team digs into recent public AI failures, reviewing the impact of each, whether they could have been prevented, and if so, how.

The group also discusses how organizations can protect their end-state applications through proactive red teaming to identify and fix risks before they go live.

They cover:

Air Canada’s customer service chatbot
Lenovo’s “Lena” chatbot
NYC’s MyCity chatbot
Pak ‘n’ Save’s meal generator

About Human in the Loop

We share our learnings from working with the largest enterprises, foundation model builders, and governments to equip you with the insights you need to build practical, powerful AI systems in your enterprise.

Watch the video or read the transcript below to get their insights from working with leading enterprises and frontier model labs.

Episode 15 Transcript

Angela: Hi, welcome to Human in the Loop. Today, we're going to be talking about some public GenAI failures that made it to the headlines. But first, before we talk about that, I would love to start by discussing with you two the difference between those public headlines and what we see privately with enterprise examples.

Dani, how do these stack up to private examples?

Dani: I would say that what we've seen internally is often a lot more extreme, I think because of the breadth and depth of testing that we do when we're testing an AI application. Yeah, but those don't make it to the headlines because we test it before it actually gets launched.

Emily: Yeah. That's actually common in a lot of security and privacy research, right? So you do a lot more to actually provide a much more thorough understanding of your system, of your application, but what actually eventually surfaces to the public is just a tip of the iceberg.

Angela: I love that analogy.

Monica: That's great. Do you have a perspective on when model capability increases? Does anything kind of change with the risks that you see?

Emily: Yeah, absolutely. This is a great question, actually. So think about how much effort all the big labs actually spend evaluating their models. And from a variety of perspectives: the basic capability, reasoning, math, and general applicability, but more importantly, safety and the RI—the responsible AI property. A lot of time, a lot of talent, actually goes into that. But still, when it comes into application development, especially enterprise real-world workflow and application environments, we do see a lot of incidents.

This is not the first news we see; there have been quite a few. So I think one of the problems really is there is a big difference between just studying the model's behavior versus the application's behavior. Because the model gives the capabilities of building a much richer capability to not just provide text as output. Also, like agent systems, you're using tools, you have memories, you have access to your private data. All these enhanced capabilities actually enlarge the surface area for security and safety analysis.

So I think about kids, right? So if you think about a baby, when they're first born, you put them in the crib. They don't move.

Dani: That's great.

Emily: Yes. The guardrail is simple. You get your house baby-proofed, and then they suddenly get to crawl and walk. Then the surface area that you need to protect in your house is much larger. And that's not even mentioning if they're able to get to their teenage years, they get their phone, they get on the internet.

Dani: That's interesting. We were just talking about this before we started filming. My tiny child toppled a small piece of furniture yesterday and I was like, you know, I really should have brought in those professionals to test our apartment for all of the vulnerabilities, and then look at my child and his age and go, "Okay, these are the ones you need to prioritize."

So red teaming is a lot like that. It's assessing what the landscape of risks is and then making thoughtful choices about where it makes sense to implement mitigations and what type of mitigations.

Emily: I really love this point. Actually, if you think about it, there's always a trade-off between how many measures, how many soft guardrails and methods of protecting your system versus how much value and the performance cost you actually pay. And then taking this pragmatic way and seeing what the application scenario is, what are the most vulnerable areas, what is the practical surface that we need to protect, is actually a very pragmatic way to solve the problem.

Dani: Yeah. And it's a tough needle to thread, from identifying the vulnerabilities but then not becoming overly restrictive. So in some of our engagements, it's a multiple-iteration feedback loop where they actually over-mitigate, and then we have to help them pull back and identify what are the right guardrails or what are the performance trade-offs that are happening as a result of putting in too much.

Monica: So that they can actually deploy something that works. Yeah, that makes sense.

Dani: Nice. I love that.

Monica: Cool. Well, thank you for that initial disclaimer. Let's go ahead and get into the headlines. So, the very first one we're gonna talk about is from Air Canada. So earlier in 2024, Air Canada's customer service chatbot told a passenger that he could claim a bereavement refund, but he couldn't. And so he sued. And then Canada's civil resolution tribunal ruled that the airline was legally responsible for what its chatbot said. Honestly, this was probably one of the first major AI breaks, I think, that we saw. So I want to throw it over to you guys. What did you think of this? Could this have been prevented? What do you think?

Angela: Yeah, that's an interesting one. I want to point out first that the reason it happened is hallucinations, and this is something we see so much in the industry with models, with enterprise applications, which is really hard to prevent fully nowadays. But one thing to note is that the model hallucinated so confidently without acknowledging any uncertainty, without saying, "You might need to check this additional policy," or anything like that. And what's funny also is that the airline then claimed that the chatbot is a separate entity and that they're not responsible for what it confidently claimed.

So let's talk about this a little bit. Like how did this actually happen in terms of hallucinations and could it have been prevented from a technical perspective?

Dani: Emily, I'd love your thoughts on this one.

Emily: Yeah, so I think model hallucination is actually something that all the model developers look into very closely. And there's also a lot of public benchmarks in terms of model hallucination and how grounded the answer is based on factual information and also the information provided in the context. So that's something that's not new, and all the model developers actually look into that.

But this is not easily preventable in the sense that when the model is deployed in a scenario where you actually have company-specific policy, right? There is information that's hard to tell if it's factual because it's private to the company, it's not part of your model training data, and it's hard to tell what is real and what is not real. At some point then, at the deployment side, at the customer side, they really need to cover the hallucination part to see whether the information provided by our application—the chatbot in this case—is actually grounded in the internal policy.

So that part is something technically feasible. But in a lot of cases, it's kind of the awareness and the liability side. The company, when you develop something around the model, now you actually have a richer capability, then you need to cover the surface that your new application actually touches upon. So that's the situation I'm thinking about.

Dani: Interesting. So it sounds like you can partially cover it, but…

Emily: Yes, the model can give you the basic proof on things and they actually are very careful in doing that. But the situation is now you apply the model, right? You want to...

Dani: Ground it in the actual policies that your company has. And maybe you're doing RAG, but...

Emily: Retrieve the information.

Dani: If you want to stick too closely to RAG, then you lose some of the broader, richer capabilities.

Angela: Exactly right. Yeah. And what about embedding humans in the loop also? Where the model notices uncertainty. I think what's so confusing to users is that it is hallucinating very confidently, and I guess as you mentioned, in the training, it's kind of rewarded on these confident guesses rather than penalized on the negative answers or when not knowing. So in that scenario, could it have detected that they should have routed the users to a human?

Emily: Yeah, that's a really good question. I think producing self-awareness—it's kind of like asking, "Do I know whether what I'm talking about is true or false?"—is actually a research topic. There are different answers to that. There are different ways you can look into the model output. There's logic, there's probability, and you can see whether you're certain or not. So there is a lot of research that actually covers this topic. But again, I feel like it's not a mature enough area in the sense that when you're not just looking at a single token—if you look at a single token output, then there is a certainty, the distribution can tell how certain you are. But now, you're actually generating a long sequence.

And in this particular case, it's not just generating information based on the internal knowledge of the model, it is based on the information retrieved based on the company-specific policy. Then answering the question of whether the model can be confident or not is actually a lot more open-ended.

Dani: You could also think about where you want to maybe automatically route things or introduce another layer. So if it's touching on refunds and it's one of the main types of refunds, maybe you're very confident that the model can handle that well. But if there are certain edge cases, you might want to route those automatically to a human.

Emily: I totally agree, and this is a very pragmatic way to actually unlock the value of the LLM-based agent without touching the areas that are not mature enough to give you a confident answer yet.

Angela: Nice. And Dani, this is something we've seen in testing, right? Where we do targeted, domain-specific policy testing and red teaming beforehand. I feel this could have been quite easily tested in a creative red teaming way, right?

Dani: If you were asking me to red team an airline's customer support chatbot, for sure, one of the big things I'm going to be touching on is discounts, refunds, getting points—anything that could have some sort of a financial impact to the company. And then, you know, it varies what the target harms are that a particular company cares about. But yeah, those are big ones for any big enterprise.

Monica: What did you guys think of the liability part? The fact that the company was held liable?

Dani: I mean, this didn't have a big financial impact for them, but it did set a legal precedent. And I think it sets the tone, right? You should be responsible to some degree for the outputs of the AI that you're launching. And I think responsible AI is a big topic and a lot of enterprises care about it. It's kind of, "Okay, well how can we embrace the capability, but understand the risk and make sure that we're not taking on too much of it?"

Angela: Yeah, and I think that's one area where enterprises don't have the luxury to distance themselves from the AI products they put out. Versus the model developers, for example, they're putting out all the chatbots and they're like, "You know, this is a separate chat entity." But as an enterprise, this is replacing a function you already have or augmenting a function you already have. So this is kind of a continuation of the website or of the policies. So it is, as you said, a legal precedent.

Emily: They're directly customer-facing. They need to be responsible for the entire customer experience and all the consequences that come from their application.

Dani: And frankly, there's a higher expectation. Like these brands already have a reputation. There's a reason that you go with the airline that you go with. And if you have this bad experience, it's sort of severing or harming that relationship. And I have a very different expectation if I go to ChatGPT and ask it for advice about my health or my son's health versus if I go to a medical company's website and they're offering a solution that's supposed to advise me.

Emily: Totally agree with that.

Monica: A hundred percent, yeah. All right. The next one had even more harm. So, in the middle of this year, 2024, researchers discovered that Lenovo's Lena Chatbot, powered by GPT-4, could be tricked into writing malicious code. So a simple prompt got it to generate HTML that would steal cookies and expose internal systems. Was this a big deal? Could this have been prevented? What are your thoughts on this one?

Angela: Yeah, the intersection of cyber and AI is very interesting.

Dani: I think the intersection of cybersecurity and red teaming is really interesting, in part because when I tell people that I do red teaming, they often think I'm talking about cyber pen testing. And I'm like, "Well, that's a relevant angle or approach you can take." But understanding model behavior, combining that with traditional security vulnerabilities, is what allowed them to achieve this break. And then, at the end of the day, how you think about your system guardrails in a world where you're using a GenAI capability is really important.

Angela: Yeah, totally. And what's interesting also is that the application was put out and I would say it wasn't really tested in the same way that software is tested in terms of cybersecurity. So when the break happened, there was a prompt injection from the user who, thankfully, was a white-hat hacker and a researcher who then alerted the company. And through manipulating the model's helpfulness, he was able to get it to output this code that was not stylized in the right output way that you would need to avoid having it generate and run the code in the user's interface.

So this is interesting because usually you would think they would test systems like that, but they're not. Is it because they're thinking it's a chatbot rather than a piece of software?

Dani: Right. It was supposed to be like a customer service chatbot, and then they used that as an angle to kind of get access into the systems.

Emily: Yeah. It's kind of like the SQL injection of the past, right? So, that's the traditional attack vector. Here, because again, when the model is deployed in a much more capable environment, you have a larger surface to be attacked. This is actually a pretty hard problem in the sense that traditionally, if you test software, you're facing a static piece of code. Here, you're dealing with something dynamic and probabilistic. You dynamically generate code and then you're generating tests on the fly.

But then again, back to the discussion that every check you have at your deployment's inference time adds latency. And that's a true trade-off between security and usability. That's a hard trade-off here. A lot of the research we see at this moment, especially on code generation, is still really looking at the capability part—what it can do. Can it actually beat SWE-bench? People are asking, can it actually be a developer? Not a lot of research is being put into the domain where we look at whether the code generated is actually safe and how easily we're able to identify vulnerabilities and attack vectors within the code.

Angela: It's interesting, you mentioned it's a hard problem. It definitely is. But I would also add it's an asymmetric problem because the researcher was actually able to break it with a 400-character prompt. And to access the entire system and the entire database and potentially get access to the cookies and then get access to a lot of internal data. So it's a really asymmetric way here of breaking something with very bad consequences.

Emily: Exactly. Yeah. So that's actually the hard part of doing all the red teaming work. Because if you do evaluation, you look into the normal usage pattern. You're always "in-domain," you have something at hand to measure against. But to do attack red teaming, you really have to think outside of the box. And the space outside of the box is really large. You need to cover a lot more. But for the attacker, as long as you find a single effective attack vector, then you're able to get into the desired—or undesired—outcome.

Monica: So if you guys had to recommend some sort of countermeasure for this, if a company's afraid of this exact kind of attack, would your recommendation be to get red-teamed before? What are some strategies that you would recommend?

Dani: My recommendation is, well, for something that's public-facing and is going to potentially enable access into your systems, I would definitely recommend red teaming. If it's related to protecting your core IP, absolutely recommend red teaming. There are some low-hanging fruit use cases that maybe don't merit it.

But I think from a guardrails point of view, you need to do testing. Like how does this influence your latency, your inference time, your costs? But you could set up almost a bouncer at the door, like input filters that just have some very hard logic or rules around, "Hey, if it has these phrases or these characters, it's likely to be an attack." You could also have something that's a little more intensive in terms of inference, which is to assess the prompt for adversarial intent. So that would be like sending it to a smaller model to evaluate the chance that this is actually an attack in some way. And then finally, output filters too.

So this attack, I believe it had an image URL that didn't actually work. And when that didn't work, that's what launched the code. If it had just rendered everything as text instead of allowing it to output code, then that would've been preventable. It's easy to dissect it after it's occurred. But I think we do always think about what the guardrails are and what the trade-offs of how you're architecting that are.

Monica: Yeah. Yeah, that checks out. All right. The next break that we got comes from New York City. So late last year, New York City rolled out the MyCity chatbot to help small business owners with local regulations. But then after it was deployed, it started to go off the rails. So it started to give false legal advice. It actually suggested that employers could fire workers illegally or lock tenants out, which is obviously a huge legal issue. Have you heard about this one? What are some thoughts that you have on this particular break?

Angela: Yeah, I've heard of it because I live in New York and it was definitely a big deal that eroded public trust. And Dani, I know you and I were speaking about this before, on whether enterprises should deploy and iterate in public. And that's definitely one thing to improve and increase the rate of innovation. How do you do that alongside responsible AI, especially in the realm of a civic public application? And the faults that we have seen in this application are quite flagrant. Like, locking a tenant out or not accepting a certain rental budget or taking a portion of your employee's tips...

Dani: That was wild.

Angela: That was wild. And it's supposed to help small business owners in navigating the many policies of the city. So if you're a business owner and you check that, how are you then going to have the trust and the confidence to use it again after seeing those failures?

Dani: I think one other interesting thing when I was reading up on it is that other agencies in New York City, or at least one of them, weighed in and gave statements publicly about areas in which the chatbot had been wrong. And so I think that, in my mind, is a clear thing they could have done before rollout, right? I actually really love this test-and-iterate methodology, and we embrace it ourselves. But if you can do that in a controlled environment first.

Like you already know that it's going to cover policies for New York that are relevant to New York City. Why not have stakeholders from all of the relevant agencies play around with it and see if they find issues before you launch? Because I was playing devil's advocate and saying, "I think it's great that they've improved it." They have improved it since then and they decided not to take it down. And Angela made the point, "Well, it could erode public trust and then who's going to rely on it?" And I think that is a really good point too. If I were told to do something illegal, I don't know that I'd be going back and relying on it.

Emily: Totally. Yeah. So one thing I was actually thinking aloud here is, there's an awareness of the importance of testing before deployment. And so with the large language model, we actually see a lot of technical barriers getting lower so that people can easily develop applications. But what comes with that is that the awareness of reliability, trustworthiness, all these policies around it is not sufficiently introduced into the public domain.

And sometimes people think, "Hey, you know, the model has been tested. I'm just building the application around it. Should I still do testing?" And also, there's a lack of technical expertise in typical application development. Because the technical barrier becomes lower, many people can develop applications. But while they can develop them, they don't have the systematic way, like Dani was talking about, of how to think about testing the system coverage in a systematic way. So that's something that's challenging, but is also very much addressable if we raise awareness from the public perspective.

Dani: And I know there's an approach that's been taken, especially because of all the web coding, where people will go try to find a really, really old, highly used repo on GitHub and do a typosquatted version of it and then have the malicious code within that. So if someone makes a typo, or if they know that a library is prone to a typo for a certain thing, maybe it'll route them there.

Emily: That's exactly right. Yeah. And with web coding, a lot of people are starting to code and develop their own systems, but the technological awareness around safety isn't there yet. So I think that's the aspect. It comes with more capabilities, more functionalities, but also comes with a higher risk that we need a broader guardrail in the room, basically.

Monica: Emily, with software testing, and I think similar to this, sometimes there's a perception that it slows down development. Do you have some thoughts on that, especially around the increased risk with AI?

Emily: That's a really good question. Yes. So, again, this is a publicly well-recognized situation. And this really requires an assessment in terms of what's the impact of a loss of trust, what's the impact of a security incident? And if people actually have a good assessment of that, then they're able to make an informed trade-off in terms of how much time, how much budget you need to spend in terms of safety testing and security testing before a system is rolled out.

So I think one thing I really want to call out is don't be overly cautious, right? Because that's actually the situation with this instance. If you don't roll it out, you don't see how many use cases, how many prompts you actually can elicit into your system. You really need to tap into the water. You have to know what's going on there. So, have a reasonable, systematic way. I think when we talk with Dani in terms of red teaming, there's actually a pretty systematic way in terms of starting from understanding the scenarios. There's a methodology to provide the basic coverage, give you the basic level of confidence to roll out, and then try and iterate from there. Yeah.

Angela: And from a product perspective, I personally believe this red teaming should become a standard product activity before rolling something out and after rolling something out as post-launch monitoring. Similarly to how you do QA to test if the application is behaving in the way it's intended to, red teaming is creatively checking, is it actually behaving in ways it's not intended to? And that's the challenging part. We can implement it systematically, a hundred percent.

Emily: Yeah. Every time I think about talking with Dani about red teaming, I think it's really hard because you are trying the things people would find hard to imagine, right? You need to be creative. Meanwhile, we also need to be reasonable.

Dani: My favorite is, people think about red teaming as adversarial, malicious attackers, which is true. And that's a lot of what we mimic. But we also put a big emphasis on, with our understanding of model behavior, we can pretty effectively and easily simulate an innocent user who stumbles into harm. Or just stumbles into a response that is out of policy. So you may not call it harm, you may call it "out of policy." But those are really, really insightful too.

Emily: Yeah. Another aspect I really like is when we have a human red team, you actually understand the usage scenarios, the business logic, like the ones we talked about, right? There's a discount, whether you're going to charge your credit card—and those scenarios are actually specific to a use case. And then you would actually bring human expertise to provide a systematic way to cover those dimensions.

Monica: It all comes back to the human in the loop.

Emily: Human in the loop!

Monica: I couldn't resist. Alright, our final headline comes all the way back from 2023, from a New Zealand supermarket chain called Pack'nSave. So they launched a recipe bot to help people use their leftovers. And what they found is that when somebody would enter cleaning products, it would produce a recipe that would mix them, a chlorine gas drink, and it would call it a beverage. And so this went viral. Obviously, it's similar to the blue pizza that Gemini had come out with back in the day. Obviously unintentionally putting out this harmful product. What do you guys think? Is this a big deal? Is this something that could be prevented? Is this just a remnant of 2023? Those long-gone days.

Dani: It's a little harder to get certain dangerous recommendations out of the base models today. So a little bit, when I was thinking about this example, it felt like a remnant of 2023. But I do still think red teaming would've addressed this, right? They would've thought about what are realistic things a person would put in, and then what are kind of edge cases, or someone who just wants to play around and see what happens. And for sure, you shouldn't have your chatbot do that. I mean, God forbid you hand your phone to your kid and you're like, "Here honey, go make a recipe with what we have in the kitchen," and they just start taking photos of things or typing them in there and then it tells them to make a toxic compound.

Angela: Yeah. I mean it's interesting because it's such a simple failure and they obviously didn't test the output to see whether the output itself made sense or not as a recipe. I believe it was mixing what, like ammonia and bleach, which is crazy, and calling it "aromatic water recipe" or some other things.

But I also like to think of this in the realm of how the technology is developing and our interaction with it in the next decade. And it's interesting, in the beginning, we brought up the examples of children in the family. I also have nieces and nephews and I see them using—and they're quite young—and I see them using the technology very natively, like speaking in voice mode. "Hey, chat, can you help me with this? Can you help me with this?" And once we reach a point where we kind of already have, a little bit, where we trust the outputs of it as factual, that's going to be really challenging. And this is where red teaming proactively is going to be more needed.

Dani: Yeah, having a deep understanding of model behavior or just having it be your day job, I prompt differently. Sometimes, ChatGPT or Gemini tells me something and I'm like, "Ah, but really..." and I know what kind of buttons to push and how to rephrase things to get it to give a more balanced take. But it still has some of those sycophantic tendencies where it'll just be like, "Great idea, queen! Continue on." It's like, "Well, maybe that's a bad idea." Right?

Emily: Yeah. And I actually really like your point because now on one side, the model capability is really becoming strong and with very thorough testing from model developers, the hallucination is becoming less and the model is more factual and more grounded. But meanwhile, it also reaches a much larger population. Where previously, when you used the model, a lot of people would know, "This is an AI product. I should not trust the output." But now, when the model on one side is becoming smarter and feels more trustworthy, on the other side, it is reaching a much larger population. Many are young, many could be older, and they could blindly just trust the output. Then that actually creates a very hidden and increasing risk, where they're not aware there is still a model and there's going to be a risk that may not be covered.

Dani: As people's capability increases and as perhaps people's trust increases, the responsibility of the companies that are deploying this technology also increases.

Emily: Exactly. Yes. Absolutely. That's such a good point.

Monica: A model that performs well is the same as a model that behaves well. Three, two, one. [Vote] Ooh, Dani in the middle. What are you thinking?

Dani: I'm probably focusing a little too much on the words, but "performs" and "behaves." There are so many things that feed into performance, like the speed, et cetera, but for "behaves," you may not know how your model actually behaves under a lot of different scenarios. So I think I'm just having trouble equating "perform" and "behave." Angela, what did you guys think? You were thumbs up.

Angela: Yeah. No, I agree. I think for me, "performs" led me directly to think about the benchmarks that we usually see about the models. And there's a lot of accuracy benchmarks that the models are measured against. And that's why we actually see a lot of hallucinations, because of how we grade them and how we reward them in training. So it reads "mathematical evaluations" to me, versus "behaves" really gives me that nuance of what are the values surrounding the universe. Yes. And the different steps to take in the model spec or how it answers.

Emily: Yeah, I totally agree with you. First off, I think performance and behavior are kind of really weak concepts and they map in a very similar way to the benchmark. And if you see model training, after it's trained and getting to the benchmark, one immediate thing you see is two different models are not captured by a single number. On different benchmarks, they have different numbers. And in a lot of cases, you actually see a model that has a good quality, a higher number in, say, reasoning or coding benchmarks, may actually have a lower number on some of the benchmarks related with safety and RI-related testing. So that's actually evidence to see a lot of cases where a model is more capable may not actually translate into a safer or well-behaved model.

Dani: So are you saying 'perform' is the benchmarks, and 'behaves' is the one that contains the safety element?

Emily: Exactly. If you look at the benchmark leaderboard, a lot of cases they have different columns, right? Each column covers something. And these two numbers don't always align.

Angela: A lot of cases they are actually conflicting with each other. You're making me think of trying to measure a human's nuanced wisdom with standardized test scores, for example. Yes.

Monica: You cannot outsource accountability for your AI's behavior. Three, two, one. [Vote] Ah, thumbs up all around.

Angela: Expected. What do you think? I mean, especially in the enterprise, as we were talking about in the episode, you really cannot distance yourself from it. Air Canada, right? Legally you cannot, from the customer's angle, you cannot. It's hard.

Dani: I think there are a lot of instances where people will consult experts for guidance, right? So they can make a more informed decision. But ultimately, you're accountable for setting up your governance process for determining how you are going to manage the capability and the risks of AI, and then where you need it, bringing in experts who can give you that information to make informed decisions. But yeah, if you don't do any of those things, you're certainly accountable for that choice too.

Emily: Absolutely. You need to be. People talk about agents a lot, right? So think about an agent as a virtual employee. From that perspective, you're responsible for your employee's behavior. This is kind of, you know, when we get into a company, we always have the handbook and the manual that I need to follow. So you absolutely need to be responsible for that. But again, technically, you can actually outsource the work of designing the policies and checking the policies to another company. So that's kind of a different dimension here.

Monica: All right, your final hot take. You only need red teaming for public projects. Three, two, one. [Vote]

Emily: Ooh, thumbs down. Emily, what do you think?

Monica: Yeah.

Emily: So, actually, recently there was a story about this kind of internal testing, where some internal employee was trying to trick, yes, they tried to write an email to their CEO to blackmail them or something. So there's always external hacking and there's also insider threat. So the insider threat is real, even with large model agents at this moment. So, yeah, definitely no.

Dani: Yeah, you always need to think about what and where are the crown jewels? What are the risks, right? And so if it's internal-facing only and there's a very small group of users, that may be lower risk. But is this where your highest value IP is? And if it gets leaked externally to a foreign entity, that's going to really undermine your company's success. Then it's pretty important to red team that.

Angela: I agree.

Monica: When you said crown jewels, I was like…unless they're in the Louvre.

The group also discusses how organizations can protect their end-state applications through proactive red teaming to identify and fix risks before they go live.

They cover:

Air Canada’s customer service chatbot
Lenovo’s “Lena” chatbot
NYC’s MyCity chatbot
Pak ‘n’ Save’s meal generator

About Human in the Loop

Watch the video or read the transcript below to get their insights from working with leading enterprises and frontier model labs.

Episode 15 Transcript

Dani, how do these stack up to private examples?

Angela: I love that analogy.

Monica: That's great. Do you have a perspective on when model capability increases? Does anything kind of change with the risks that you see?

So I think about kids, right? So if you think about a baby, when they're first born, you put them in the crib. They don't move.

Dani: That's great.

So red teaming is a lot like that. It's assessing what the landscape of risks is and then making thoughtful choices about where it makes sense to implement mitigations and what type of mitigations.

Monica: So that they can actually deploy something that works. Yeah, that makes sense.

Dani: Nice. I love that.

So let's talk about this a little bit. Like how did this actually happen in terms of hallucinations and could it have been prevented from a technical perspective?

Dani: Emily, I'd love your thoughts on this one.

Dani: Interesting. So it sounds like you can partially cover it, but…

Emily: Yes, the model can give you the basic proof on things and they actually are very careful in doing that. But the situation is now you apply the model, right? You want to...

Dani: Ground it in the actual policies that your company has. And maybe you're doing RAG, but...

Emily: Retrieve the information.

Dani: If you want to stick too closely to RAG, then you lose some of the broader, richer capabilities.

Emily: I totally agree, and this is a very pragmatic way to actually unlock the value of the LLM-based agent without touching the areas that are not mature enough to give you a confident answer yet.

Monica: What did you guys think of the liability part? The fact that the company was held liable?

Emily: They're directly customer-facing. They need to be responsible for the entire customer experience and all the consequences that come from their application.

Emily: Totally agree with that.

Angela: Yeah, the intersection of cyber and AI is very interesting.

So this is interesting because usually you would think they would test systems like that, but they're not. Is it because they're thinking it's a chatbot rather than a piece of software?

Dani: Right. It was supposed to be like a customer service chatbot, and then they used that as an angle to kind of get access into the systems.

Dani: That was wild.

Monica: It all comes back to the human in the loop.

Emily: Human in the loop!

Dani: As people's capability increases and as perhaps people's trust increases, the responsibility of the companies that are deploying this technology also increases.

Emily: Exactly. Yes. Absolutely. That's such a good point.

Monica: A model that performs well is the same as a model that behaves well. Three, two, one. [Vote] Ooh, Dani in the middle. What are you thinking?

Dani: So are you saying 'perform' is the benchmarks, and 'behaves' is the one that contains the safety element?

Emily: Exactly. If you look at the benchmark leaderboard, a lot of cases they have different columns, right? Each column covers something. And these two numbers don't always align.

Angela: A lot of cases they are actually conflicting with each other. You're making me think of trying to measure a human's nuanced wisdom with standardized test scores, for example. Yes.

Monica: You cannot outsource accountability for your AI's behavior. Three, two, one. [Vote] Ah, thumbs up all around.

Monica: All right, your final hot take. You only need red teaming for public projects. Three, two, one. [Vote]

Emily: Ooh, thumbs down. Emily, what do you think?

Monica: Yeah.

Angela: I agree.

Monica: When you said crown jewels, I was like…unless they're in the Louvre.

What Enterprises Can Learn from Public GenAI Failures | Human in the Loop Episode 15

Episode 15 Transcript

The future of your industry starts here

What Enterprises Can Learn from Public GenAI Failures | Human in the Loop Episode 15

Episode 15 Transcript

The future of your industry starts here