Today on Human in the Loop, Kaitlin Elliott, who leads firmwide Generative AI Solutions at Morgan Stanley, joined Clemens Viernickel and Sam Denton in the studio to unpack how AI evaluations powered the firm’s successful adoption of production GenAI. We also react to some AI Hot Takes, of course.

Morgan Stanley was among the first major financial institutions to experiment with GenAI, all the way back in 2022 before it hit the headlines. Their journey offers a valuable real-world case study in moving from early pilots to enterprise-wide deployment.

The team talks through the evaluation frameworks and set up they used, as well as how Kaitlin drove adoption and confidence for the application through the organization. If you’re working on rolling out a GenAI application at your enterprise, you don’t want to miss this one.

About Kaitlin Elliott

Kaitlin is an Executive Director in the Firmwide Artificial Intelligence division based in New York City. She leads the Firmwide Generative AI Solutions Team, which focuses on implementing cutting-edge Generative Artificial Intelligence (GenAI) models at Morgan Stanley. In this capacity, she is accountable for establishing rigorous protocols to assess generative AI model outputs and ensure alignment with organizational goals. She graduated from Providence College with a Bachelor of Arts in History and has been with Morgan Stanley since 2015.

About Human in the Loop

We share our learnings from working with the largest enterprises, foundation model builders, and governments to equip you with the insights you need to build practical, powerful AI systems in your enterprise.

Watch the video or read the transcript below to get their insights from working with leading enterprises and frontier model labs.

Episode 13 - AI Evaluations in Practice

Clemens: My first question is, in the wealth management and banking space more generally, it seems that Morgan Stanley is a first mover and was very early in adopting this technology. Why would you say it's so essential for the firm to be at the forefront?

Kaitlin: I would say it's been part of our DNA for a long time to be investing in technology. For the last couple of years, the firm has taken a focus on that. So in early 2022, we came across this new generative AI technology, which, as you all probably know, when it first came out, was the magic of the demo. Back then, our executives saw the capability of it being able to write a poem, and they were just blown away by that. As a firm, we’re always thinking about how we can invest in technology to ensure our employees have the best tools possible, but it's also important that our clients have the best technology available to them.

For generative AI, leadership took a gamble in those early days. It was well before ChatGPT and before much of the world knew about the technology, but they decided to lean in. Our first use case was ready to go because we have been dealing with the knowledge management problem for a long time, which many enterprises have. We had already started to invest a lot into that problem. We had gone on a data journey to curate our content, to make sure it was up to date, and also to ensure it was tagged appropriately.

We had been working in the conversational AI space for a while, where we had virtual assistants servicing both our employees and our clients. The gap was that it took us several years to put 10,000 FAQs together. When we first started the generative AI journey, those assistants could only answer about 10 to 20% of the questions that were asked of them. All those years, all that effort, and 10,000 FAQs, and we still weren't covering what our employees really needed from a knowledge management perspective.

When we saw generative AI technology, we decided to see if we could take all of our internal knowledge in our wealth management space. We specifically focused on procedures, process documents, and our research reports. When we first saw GPT—I think we were using GPT-3—we thought we could quite literally just give it the documents and it would get it right. We were totally in the dark when we set up our first use case. After a while of experimenting, we realized that we had built what is now RAG, but we didn't know what RAG was. Almost immediately, we were able to discover that the coverage we had was significant. That's when we had the "aha" moment that this might work. It's when we decided we had to go all-in and invest here to make sure the firm would be on the forefront.

Sam: Do you think the infrastructure and data you had set up for the knowledge problem was a red herring as an excuse to lean into it? Or do you think it allowed you to move quickly in the early days and make something useful in a short amount of time?

Kaitlin: It was probably a combination. I think that for many corporations at the time generative AI started to become big, there was a lot of hesitation. We were well-positioned from an appetite perspective, but we also had the problem that we could solve and experiment with. And we were well-positioned with our data. The foresight is pretty incredible when you think about the journey we had already been on.

Sam: That makes a lot of sense.

Clemens: You mentioned an interesting point. You were an early adopter when the models first came out. On one hand, I vividly remember it was a truly magical moment of, "Oh, technology can do that. It's amazing." At the same time, it didn't take long for people testing it out to say the models were bad at many things. This is the first time the term "hallucinating" came up, meaning the models were not good enough for use in production. I'm fascinated by what led the firm to lean in and adopt this for a crucial problem, even when it seemed more like a toy than a productivity tool to many people. How did that dynamic play out?

Kaitlin: Great question. I would say that when we first set it up—and what I love about the generative AI space, as you guys can appreciate, is that the demos are fantastic. Everybody has an amazing product in a demo, but when you try to apply it and have it consistently perform, it's just not there a lot of the time.

Clemens: We have an episode on reacting to demos.

Kaitlin: There you go. Next time I come on, we can do that. When we first set it up and started to ask a few questions, we were a little surprised that it worked. In those first instances, we asked procedure questions like, "How do I open up an account for my client?" It was able to give us a response, and we were like, "Wow, that's pretty good." But then we recognized fairly early that my team didn't have the subject matter expertise. The answer looked good to me, but I wasn't a financial advisor, so I didn't actually know if it was useful. Our first assistant was a wealth management knowledge management system, and since I wasn't a subject matter expert (SME), I didn't know if the steps it provided to open an account were all correct, or if a missing step was a critical one.

So what we did very early on was create an experimental lab environment. The first thing we did was grab a bunch of SMEs and end-users—financial advisors and their support staff—and we asked them to go in and play around with this new tool. Using an assistant wasn't new to them because they already had one, but this generative AI was. We had them go in, ask a bunch of questions, and then rate for us what was good and bad about it. If something was bad, we thematically started to bucket it. Was it because it had an inaccurate answer? Was it because it was incomplete and missed a few steps? Was it because it completely hallucinated?

When we first did this, it’s now common practice to ask the AI to cite its source, but we were building RAG before it was a thing. We started to learn these techniques. If it's hallucinating, how do we know where it got its answer from? We started to add in the source and then ask the SME, "Was that the right source?" Some problems were from the language model going off the rails, and many others were a search problem. It got very complex very quickly. But after a few iterations of working with SMEs and getting feedback from our end-users, we realized there was value there. While it had shortcomings, it was still able to answer the questions enough that they would get value.

Ultimately, we ended up taking about 25 questions, and we had the AI answer them while an SME did the same thing. We gave them both an hour. Without a doubt, the AI answers were much better. The AI was able to answer all 25 questions in an hour, and the humans weren't. Doing that test early on solidified for all of our control partners and leadership that it was better than what we were getting from humans, so we continued on that path.

Sam: You're touching on what sounds like the early days of an evaluation framework. It sounds like it was both a place where you could understand the user journey and also use it to prove to your leadership that this was something worth doing. I'm curious, why is evaluation important to you? Is it the former, the latter, or a combination? And as a follow-up, how do you not get stuck in an evaluation loop? You're never going to get to 100% on everything, so how do you pick and choose your battles?

Kaitlin: That last point is really critical. For us, as a financial institution, it's important that what we're doing is accurate. We don't have a lot of room for error. That’s the main difference. I'm sometimes envious of startups when they say, "Oh, we're building all these cool things." That's good for them because they can put that into the world with a disclaimer saying, "This works only sometimes, and we need your feedback to make it better." We couldn't really do that.

Sam: You can add disclaimers, but…

Kaitlin: Yes, which we did. We piloted the application for nine months. In today's world, taking nine months to test and validate would be impossible; we'd always be behind. But at that time, we had the breathing room because we had our first proof of concept built in September of 2022. ChatGPT didn't come out for another month or two. As a result, we were able to take that time to think about the framework.

A lot of human intervention was needed. We couldn't rely on AI-assisted evaluations or traditional machine learning evaluations. We tried to throw a thousand questions at it and run a cosine similarity, and we realized it literally told us nothing. So we had to strike a balance, since we couldn't review every single input and output. What we started to do was create a regression suite of testing, where we picked 500 questions and used that as a base. We leveraged that every time we wanted to make a change to our solutions.

As I said, many issues in the beginning were search problems. We would try to add business rules to our search, run the 500 questions against it, and then assess if it was good, bad, or how much it broke. We did a lot of prompt engineering. In those early days, as we were doing it, we uncovered things like few-shot examples, and we kept adding that stuff in.

Because of that, the most important thing for us was to create a taxonomy of testing. It's not only a framework, but if we say we're hitting 80% accuracy, what does that mean? Everyone can describe accuracy differently. When we got through those nine months, we were able to take not only the framework but the taxonomy, the different methods, and the approach we learned through the assistant. That became the framework that we used when we built out our governance process with our control partners, like our model risk management, legal risk, and compliance teams. We were able to say, "This is the foundational framework that every use case can use. This was what our wealth management assistant was able to achieve." The expectation is that if another use case creates an assistant, for example, we now have a baseline. That was definitely worth the nine months of setup because it helps all other use cases to expedite. They don't have to think about creating a data set, what it should look like, how much data should be in there, or what methods to use for testing. We were able to say, "This is how you do it."

Clemens: How do you define that decision point of, "Okay, when are we actually putting this out into the world beyond a few testers?" You said something interesting earlier when you did the test of having an AI do it and asking if it was actually better. It will obviously be faster, so the question is whether it makes catastrophic mistakes. I think that's a nice approach to mitigate fears about how good it needs to be to be valuable.

I have a great anecdote from Jason, our CEO at Scale, who told me about his previous work at Axon. When they first told people they were going to put video into the cloud instead of on-premise or local storage, people asked about the data loss rate. They did some investigation and found the loss rate was 5%, and management said it was unacceptable. Then they asked the follow-up question, "What is the loss rate of the hard drive data we have today?" It turned out the loss rate was 25%. That’s along the same lines of what you described—doing a head-to-head comparison. I'm curious how you came to the decision of rolling the solution out at Morgan Stanley.

Kaitlin: It’s a common thing we do as humans—we overestimate capabilities. In those early days, that was a way for us to realize that you're never going to get 100%. We came to the understanding, "Wait, we're trying to get perfection, but we don't expect that of humans." That comfortability of being able to prove that when a human does this task, they do it at 40%, and the AI does it at 80%—and the risks are still there that a response could be wrong—was key.

One of our biggest core principles at the firm is the change management of how people work and getting them to understand that when you use AI as a co-pilot or a tool, you have to think of it the same way you did when you first learned to use Google for a research report. We were taught that you can't believe everything you read on the internet and you have to cite your sources.

Sam: They make it harder.

Kaitlin: They definitely make it harder. But it's about pushing that adoption and education training with people. We're going to pilot this, and what I always say to people is, "When we're piloting, if you get bad responses, that's great." We're trying to create this experimental, innovation-type mindset where it's like, "You're going to pilot this, it's going to get things wrong, and that's the best thing ever because that's the purpose of our pilot." We want you to tell us that. Then we can go behind the scenes and update the prompt or fix the parameters to get it to a place where 80% of the time it's consistently giving good answers. Then we'll feel comfortable putting that into the world.

It always depends on the use case too. There are obviously some use cases that you want to be consistently right all the time, for example, if you were going direct to a client, which we haven't done yet. But when you're internal and doing something like a summary or an internal virtual assistant, I think you have more wiggle room to rely on the human to understand what was a good or bad response, the same way they would if they were Googling something today.

Sam: I'm definitely going to remember that one: getting things wrong is bad for a proof of concept. That's a fantastic way to think about it.

Clemens: Another question: you mentioned having a set of questions to benchmark against. How do you think about going from having that set of questions to benchmark on and figuring out what's going wrong, to actually controlling what's happening once you have it deployed and people are using it? How do you think about evaluations during live inference?

Kaitlin: For our assistant, what we do is take a percentage of the everyday interactions that employees have and review them. We might review, say, a hundred questions a day. We have a whole team of annotators with a rubric of criteria for how to grade things. They'll go in and grade: Was it a good response? Was it a bad response? Was it complete? Was it accurate? Did it get the right article? Was there a better article it could have pulled? They annotate all of that. So every day, we're looking at the trend of the assistant to see if it is consistently staying within range or if it went widely off. If we see that it deviates, that's a huge indicator that something is broken, which could be a range of things.

One of the things we learned very early on about human annotation is that we try to bucket things into core issues. That was critical because when we go to triage, we can see things like, "Consistently, it's not pulling timely articles." So something must have happened with our business rules where it's no longer filtering on date. As we think about moving into a world where AI does more of that review at the time of inference, those business rules become critical. I do think it's important that humans do the first pass because humans can critically think about the decision points that caused them to bucket something into a certain issue. Then, leveraging that SME-type knowledge and giving it to the AI will help the AI be much more effective at solving the problem. Today, we do a combination of both AI-assisted and human annotation. They are compatible with each other in terms of how good they are, and they also help us to scale when we're able to have AI look at 5,000 annotations versus just a hundred in a day.

Sam: You definitely make it sound like you've figured out evaluation. So now we're going to ask you about something you haven't figured out: we have this whole world of agentic AI coming. You have LLMs taking actions for people, calling tools. How do you think your evaluation framework will scale to LLMs taking actions? It's a little different than having human annotators saying, "This is good or bad."

Kaitlin: Definitely. The way I've been thinking about it personally—and at the firm, we're being very cautious because we know how critical it is to start with evaluations—is that the problem in this agentic world is that everyone has agents. Every product, tool, and workflow is automatically an agent now. But for us, when we think of agents, we think of autonomous agents—everything you were just talking about. An agent that can self-select the tool it wants to use and the action it wants to take.

As we start to think more about that, I think of it more like an employee. If you're going to have it start executing these different workflows and taking action, then you need to understand what that supervision looks like. What are the decision points where maybe it asks the human for permission? You're going to see a lot more of that in the early days of this being implemented at a place like Morgan Stanley, because it's going to be so critical for the human to sign off. You'll see a combination of potentially having other AI be the guardian.

Sam: A senior AI.

Kaitlin: Exactly. But it does become very complex. When it is taking actions, I think of it like an employee in the sense that we also have to have some type of registry of these agents. What is this agent's purpose? What does it have credentials to do? That's going to be so critical for us to be able to triage, bucket the issues, and understand where it goes wrong. It's definitely going to get complex, and I don't think we have a great answer for that yet because it's still so early, but it's something we're thinking a lot about. It's the same guiding principles of how we create that core foundational framework that everyone can follow. As this grows across the firm, we don't want all these deviations in the ways that people are evaluating these agents. We want one way that we do it across the firm so we can have a really good view on what all the agents are doing, how they're doing it, and if something goes wrong, we can triage effectively and immediately.

Clemens: One of my favorite topics from working across our engagements is how you deal with change management. You mentioned introducing these tools, and it sounds easy. What's great for us at Scale to see is that there's a lot more readiness among enterprises to adopt new technology fast. But that always requires people actually using it, getting used to new workflows, and working with new types of coworkers. How do you approach change management at Morgan Stanley so that you are able to move fast and get people to use the tools?

Kaitlin: It's a multifaceted approach. In one sense, we have a lot of change management that has to happen with our existing governance. Many of our governance and control reviews weren't built for generative AI; they were built for traditional machine learning processes and procedures. So there's been a lot of change management that we as a firm have gone through over the last several years. Huge credit to our governance and control partners who were able to say, "Okay, this is a new world. How do we think differently about evolving how we review this stuff?" Because to the point made earlier, you could be stuck in a governance review forever and never make it out.

That's one aspect. Then there's the change management on the employee side. For as many employees as we have raising their hands saying, "We're so excited, when can I get the next tool?" we also have employees who have been at the firm for 40-plus years and barely use their computer. For us, it's been a combination of a lot of media engagement, like putting together best-practice videos and sharing resources internally that people can self-service at their own time. What's been really cool over this last year is that in the beginning, we were putting out videos that were much more like "Prompt Engineering 101," and now we're talking about how to build an agentic solution. It's becoming much more advanced, which has been fun.

Then we do a lot of hands-on work. At Morgan Stanley, they have a very collaborative culture. We do a lot of hands-on work with our employees, literally sitting at their desks and saying, "When you come onto your computer every day and go to do a task..." I'll use the wealth management assistant as an example. I always say to a financial advisor, "I know that you look at the market every single day because that's your job. And I know you probably have a particular client that calls you every day asking, 'What is Morgan Stanley's view?' That's why you work at Morgan Stanley—so you can present the Morgan Stanley view." Instead of coming in every day and looking something up on Google, why don't you come in every day and type into the assistant, "Tell me the things that I should know today based off of what has happened in the market, and tell me Morgan Stanley's view of it."

Just creating that habit. I've gotten so much feedback from people saying, "Wow, that was really impactful. Once I created that habit, I started to use the tool, then I started to understand what it unlocked, and now I do things with it that I didn't even know it could do." It's very complex; we're an organization of 80,000-plus people across the globe, so there are also intricacies between what employees have in the US versus Europe, given the different rules. But we're still very much in the education and adoption phase. It's also about proving to leadership the progress we're making so that they continue to be invested as well.

AI Hot Takes

Monica: You should have a full eval suite ready to go before you start building your app. Agree or disagree?

Sam: [Disagree] I just think you learn so much on the job. You see what happens once you start using it, and you may have to add some things. But I also work in a different industry than you.

Kaitlin: [Agree] I can understand. We're definitely on different sides of that coin. For us, I think it's important that you have an understanding of what you want it to do so that you can actually assess if it was good or bad. Because if you get too far down building a solution and then you test it with real questions that users ask and recognize that it's useless to an actual end user, you find yourself in a not-so-great spot.

Sam: What about when that exec asks for that poem again, though?

Kaitlin: True. For creative use cases, I get it.

Monica: For large-scale AI transformations, enterprises need to slow down and value accuracy over speed.

Clemens: [Disagree] We have a pretty good split here. My personal perspective is that accuracy is obviously super important, but I hate to prioritize accuracy over everything else because it doesn't tell you anything about what the stakes are. Oftentimes, you can have a much greater impact without reaching a certain level of accuracy by keeping a human in the loop in the proper way and having the right guardrails around things. I feel that is probably more important right now to not get outcompeted. Of course, the accuracy needs depend on the use case.

Kaitlin: [Agree] Absolutely. From where I sit, I think some company, somewhere, will make a mistake sooner rather than later, and I can almost guarantee it'll be because they didn't focus on accuracy. So for us, it's just a core principle that we are not budging on our standards.

Sam: [Disagree] Maybe if we replaced "accuracy" with "usefulness," then I would agree with the statement.

Monica: The biggest barrier to enterprise AI isn't technical, it's people.

[All Agree]

Sam: I think it's just about taking what we already have. We already have things that are objectively really cool, useful, and valuable. If we can figure out how to frame the product and the people around what already exists, I think there's a lot of value to be had.

Clemens: I feel like change is always hard, so adopting anything new is difficult. Especially with these tools, using them is a very new kind of thing, so it requires creativity. Just showing people how to get something valuable out of it is oftentimes the solution. Sometimes I see a very creative way to use the models that I've never thought about, and then I can do it every day.

Kaitlin: It goes back to my first statement, which is it's easy to build proofs of concept with this technology; it's hard to get it to a place where it's useful. I think our employees are the ones who are going to unlock the coolest, newest use case for an investment bank, much more than I will be able to. The more they use it, the more they're used to it, and the more that they bring it into their workflows and their everyday lives, the more that I'm going to be in a position with my team to be able to capitalize on something really cool that we didn't even know we could solve with AI. That's where the magic starts to happen, and I do think people are still the barrier to that at this point.

About Kaitlin Elliott

About Human in the Loop

Watch the video or read the transcript below to get their insights from working with leading enterprises and frontier model labs.

Episode 13 - AI Evaluations in Practice

Sam: That makes a lot of sense.

Clemens: We have an episode on reacting to demos.

Sam: You can add disclaimers, but…

Sam: They make it harder.

Sam: I'm definitely going to remember that one: getting things wrong is bad for a proof of concept. That's a fantastic way to think about it.

Sam: A senior AI.

AI Hot Takes

Monica: You should have a full eval suite ready to go before you start building your app. Agree or disagree?

Sam: [Disagree] I just think you learn so much on the job. You see what happens once you start using it, and you may have to add some things. But I also work in a different industry than you.

Sam: What about when that exec asks for that poem again, though?

Kaitlin: True. For creative use cases, I get it.

Monica: For large-scale AI transformations, enterprises need to slow down and value accuracy over speed.

Sam: [Disagree] Maybe if we replaced "accuracy" with "usefulness," then I would agree with the statement.

Monica: The biggest barrier to enterprise AI isn't technical, it's people.

[All Agree]

How Morgan Stanley deploys AI that actually works (hint: it's evals) | Human in the Loop: Episode 13

Episode 13 - AI Evaluations in Practice

AI Hot Takes

The future of your industry starts here

How Morgan Stanley deploys AI that actually works (hint: it's evals) | Human in the Loop: Episode 13

Episode 13 - AI Evaluations in Practice

AI Hot Takes

The future of your industry starts here