
Welcome to Human in the Loop.
We share our learnings from working with the largest enterprises, foundation model builders, and governments to equip you with the insights you need to build practical, powerful AI systems in your enterprise.
About the Episode
In this episode, Scale’s Head of Enterprise ML, Sam Denton, Head of Enterprise Engineering, Felix Su, and Head of Product, Scale GenAI Platform, Clemens Viernickel, dig into a core pillar of AI governance: Evaluations. They cover why traditional ML metrics fall short when evaluating AI applications, how to assess agents and end-to-end workflows, and how evaluations are critical to building trustworthy, production-ready enterprise AI.
Key recommendations from the episode:
-
Rethink evaluation from the ground up: Traditional ML metrics don’t cut it for GenAI. Enterprises must evaluate entire AI agents and systems through qualitative, judgment-based methods focused on safety, trust, and reliability.
-
Bring engineering rigor to GenAI evals: Treat evaluations like software development. Use CI/CD pipelines, versioned tests, and regression checks to catch failures early, ensure consistency, and continuously improve systems in production.
-
Define success for your business, not the benchmark: Standard eval suites are helpful, but real value comes from creating custom metrics and criteria tailored to your enterprise use case—an investment that pays off in clearer insights and better model performance.
Watch the full episode or read the transcript of their conversation below.
Episode 6 - Enterprise Evaluations
Sam: Today we're going to be talking about enterprise evaluations. To kick us off, Clemens, can you start by talking about the different types of enterprise evaluations and the difference from traditional model evaluations?
Clemens: Yeah, sure. First, when we think about traditional model evals, most of the time, people think about precise quantitative metrics like precision and recall. There's a very clear way of measuring the quality of these models. That's very number-driven, very metrics-driven. With GenAI models, which are non-deterministic and generate new content, we have moved to a more qualitative framework. The idea is you need to assess metrics that are more about judgment of the quality and safety of that output. For example, imagine a classification problem in traditional ML where it's pretty clear what outcome you can determine the result from. In contrast, if a GenAI model is producing a report, assessing its quality, even its accuracy, might be significantly more qualitative in nature.
On top of that, we of course have the element that in the GenAI world, we often embed these GenAI models into agents or larger systems. That means we don't want to assess only the quality or the output of the model itself, but of the end-to-end system or the agent. So, say there's an agent doing extraction of certain signals from a phone conversation. That might involve several steps: taking in the audio input, implementing some additional business logic, and eventually producing an output. We want to assess the overall quality and results of that agent as a system, versus only the individual model and its quality.
What we generally see on the enterprise side is a lot more focus on trust, safety, and responsibility. That also leads to this element of there being two dimensions to evaluation when we think about the enterprise. On the one hand, there's the traditional, perhaps more machine learning-driven angle of wanting to figure out the actual quality and what is going wrong—where can we make the system better? On the other hand, there's the element of the actual quality of the overall solution, in the sense of: is it ready to be rolled out to prime time and to production users? Or, after it's being rolled out, what are still the elements where the system is failing when it's encountering real-world interactions? So, all of these things—being non-deterministic, more qualitative, being more focused on trust and safety, being focused on the entire system, and then also this dichotomy between quality assessment to figure out what's wrong versus determining if it's ready to be rolled out.
Implementing Evaluation in Practice
Clemens: But with that as a setup, Felix, maybe to go one level deeper, you can talk a little bit about how we implement this in practice. How do you set up the system?
Felix: Yeah. We tend to think of agent evals as somehow different than the normal software development lifecycle that has been going for decades now. So, what do you do when you set up a code project and want to deploy it to production? You have to pass some unit tests, you have to pass some integration tests. And that's all set up through your CI/CD pipelines and things like that. Somehow, because AI and agents are here, we get confused and try to reinvent the wheel.
What do you need to do? Before you deploy your agent to production, it has to pass some eval criteria. It has to meet a certain bar. Now, the thing that I think freaks everyone out and is confusing is the stochastic behavior, and that's maybe what is confusing. I think Sam in a second is going to go into some different techniques we can use to somewhat control that behavior. But the setup is the same from an engineering perspective. How I would set up the project is: you have a CI/CD pipeline, you have eval tests that the agent needs to pass. Those tests are versioned, they're auditable, they have specific things that people can point to if there are failures in the environment. Whenever there's a failure in a production system, you add that as a regression test to your evaluation test suite.
So, I think it's really important for us to not overthink, but also to think carefully about the differences. The "not overthink" part is the setup. The "think carefully" part is how you do the setup, which Sam is going to get into. And for all of our customers, a really important aspect of our jobs is education.
But Sam, I talked about all this high-level stuff. Why don't we get into the details? How do you set up an eval? What does unit testing with evaluations look like in practice?
Sam: Yeah, thanks Felix. I think Clemens touched a little bit at the beginning on how eval has changed a lot from traditional ML systems; it's a little bit more qualitative now. You talked a little bit about the infrastructure required for setting up these tests. But then there's this open question of what you fill that infrastructure with. What are the actual unit metrics that you populate it with?
There are obviously a lot of frameworks out there. There are things like RAGs (Retrieval Augmented Generation) and other standardized evaluation suites: truthfulness, groundedness, and things like that. But I think what we're finding in the enterprise setting is that those are helpful as signals, but really, every enterprise problem is different, and the way you define success is really different across every enterprise problem. So, I think it does require a little bit of effort at the beginning to figure out: What is the business outcome I'm trying to solve? What are the metrics I can use to be guideposts to see that I'm doing things along the right track? You need to be comfortable with both noisy signals that you somewhat trust and also signals that you really trust.
When you think about these unit tests that you've brought up, Felix, you can think about those unit tests as the signals you really, really trust. But then also you can use things like LLM-as-a-judge and LLM preference ranking as these noisy signals that you're okay with being wrong in its evaluation 5% of the time, to make sure that your system is directionally improving.
So I talked a little bit about LLM-as-a-judge; obviously, that's a huge lever to unblock evaluation in some of these more ambiguous situations. I think other things we've seen is that there are certain business outcomes where, maybe you're thinking about a customer support situation, and you want to make sure that you're evaluating what percentage of the time you are routing to the right agent in the customer support environment. That is a very clear quantitative metric that you can track over time and make sure you're improving on. But in order to do that, you have to have a system in place and a platform in place that allows you to capture all the logs, capture when you routed to the right system, and then populate these metrics into a platform where you can track it over time.
I think people really underestimate how much of a pain that is. This is not a trivial plug-and-play situation. Every enterprise comes with their new metric, their new success criteria, and their new definition of what's important—P90, P99. You need to be flexible enough to be able to look at these logs, look at these traces, figure out where this decision is happening, figure out what your ground truth value is for that situation, and then actually evaluate it and populate it into some platform. That's what we use SGP (Scale GenAI Platform) for when we go into these enterprises for delivery: this flexible evaluation place where we can take these logs and then populate metrics. But it is really hard on a case-by-case basis.
Felix: Yeah.
Challenges in Auto Evaluation
Felix: I had a quick question, actually, Sam. I remember a while ago, we were chatting about this, I think sometime last year, where we were talking about auto-eval and you were saying, "This is hard. The fact that everyone thinks you can just throw an LLM on here and ask an LLM to grade another LLM—this is ludicrous." So, I want to know what your biggest learnings were from that experience. What are some of the takeaways that make this a really hard problem, and what are some pitfalls that people can avoid?
Sam: Yeah, that is a good question. I think the best story I have around this is—and I think this really scares people—we had a project where we probably spent two or three months having one ML engineer full-time trying to improve our auto-evaluation system for one of our biggest projects. That is a non-trivial amount of ML cycles going into just figuring out what is the best possible signal we can gather on evaluation executed by an LLM. When I say that, I think people assume, "Well, you can just prompt an LLM and see what happens." But if you really want to trust it, you have to iterate on this evaluation flow.
Some of the things that we found to be most successful in that space of using LLMs for evaluation are:
To start with a very small eval set that you really trust and use that to measure your auto-evaluator, and then scale it up from there.
Another thing that we found to be really, really helpful is in-context examples and having really, really high-quality in-context examples with also really high-quality explanations. This gives the LLM some alignment to what you mean as a human. Rubrics are great, but at a certain point, we're talking about human subjectiveness between what's a two and what's a three, what's a four and what's a five. So, it really requires some really thoughtful in-context examples to help you scale up what you're thinking about when it comes to your rubric for a specific auto-evaluation.
Finally, we have seen some examples where you can actually train models for auto-evaluation and have a separate judge. But I think those are pretty specific cases where you're looking for something with a really specific quality setup.
Clemens: Yeah, I mean, it's fascinating what you're describing, Sam, also how using agents themselves to do evaluation of other agents. It sounds very meta, but in that sense, it's really great how the technology can leverage itself almost to improve, even though it's not magic. It's not a magic button, and it's also a pretty iterative process. But that's pretty fascinating. I think one aspect that you haven't really touched upon yet is: what about the data? When we talk to enterprises, the typical question that comes up is, "Oh, what do we actually evaluate on? Do we need to come with a huge golden dataset that we need to have ready?" What do we typically see in terms of how we get to a great evaluation dataset? I think it's a good topic.
Sam: Yeah, definitely is. It is such a hard problem, actually having a dataset that you're really excited by. Generally, our guidance is that it is really, really powerful and helpful to spend a little bit of time having a thoughtful human dataset that has a certain breadth of prompts you're expecting, a certain breadth of seed starting tasks, and things that you think cover the whole range of what you're hoping for. But after that, I think we've found a lot of power in scaling up synthetic datasets.
So, what we do is we work with these enterprises to ask them to give us the starting point and then do everything we can to scale up synthetic datasets from there. Sometimes we'll come and revisit those human datasets and ask for a few more examples because we're finding some things in production maybe that weren't necessarily captured there. We can take some of these traces that are happening in production and then put them back into the human evaluation set and have human SMEs go in and say, "What is the correct answer? What should have happened here?"
But the long answer short is I think it should just be recognized as a separate challenge: actually creating this dataset. And although we do ask for that initial ramp from enterprises to help us get that human dataset to really understand the problem—because otherwise, I think you're flying a little bit blind without that human evaluation set at the beginning—once it's defined (and it can be pretty limited in scope), then there are a lot of tools that we can use to get this noisy signal that we're hoping for and that we can trust.
Future of AI Evaluation
Sam: I think from here, one thing that I was interested in pushing forward as a conversation topic was really around what happens to evaluation from here. I think we've talked a little bit about RAGs and agents with tool calling, but there's this next evolution of LLM agents coming, where we think about long-running agents that take two days to do a task, where you have an agent call out to a bunch of different other agents that all have their own tasks. You want to make sure that you're evaluating this to the best of your ability, but it's also a really complex and long-running situation. So, I'm curious how you think about the future of evaluation as we get to these next-generation agents.
Felix: One thing is, if you run an agent for a short session, let's say it's an OpenAI chat completion or even an AI workflow that does several in a row, you can kind of say, "Oh, these are all the decision points that I have to make sure are right." The retrieval should be accurate; the AI part should be accurate. If you can decompose your agent into a deterministic set of nodes and graphs, then you can probably come up with a convincing dataset.
Now, let's explode the problem. Let's say the graph can roll out non-deterministically to any part of the nodes infinitely. You have compounding error rates everywhere. So how can you really be sure that you've controlled the problem? Okay, sure. I controlled this problem to 75% accuracy. So in a single one-session run, it's 75% accurate. But what about 0.75 times 0.75? Your accuracy rate drops significantly.
Human in the Loop and Long-Running Agents
Felix: And so, I think my opinion of this is that human in the loop is so critical in these stages. I think there's going to be a separation between what eval is and what functional, running agents are. Eval should be saying, "Alright, I know this is my graph. With a single rollout or a subset of rollouts, I'm going to control each of these individual things. I am going to make sure retrieval is 75% accurate. I'm going to hit the accuracy numbers there." But that's not to say in production, it's not going to fail.
So then, switch to production. You're going to say, "Let me build my system in a way where when I need to reset my accuracy points, I'm going to offload to a human."
I gave the example in a previous episode where my calendar agent went crazy and started—for those of you who didn't watch that episode—deleting my events and scheduling stuff with C-Suite. Probably not a good idea.
So, putting in injections where you need to say, "Okay, let me reset. Let me get 100% accurate up to this point. Reset all my probabilities. Let me vet with a human that these are all okay to do," then gives you confidence to move forward. So if my retrieval is at, say, 75%, which is probably too low, but let's say the retrieval accuracy is at 75%, but then I confirm with a human, I reset my accuracy to 100% because the human said yes, these are all the right documents. So it's not compounding error rates of 75% all the way through; you're confirming, resetting, and getting back to checkpoints.
To me, this is how humans operate. If I hire an engineer on my team, maybe a new hire, I would say, "Hey, why don't you go and take care of this thing?" And periodically along the way, they're going to say, "Hey, is this right? Is this right? Is there something that I can do?" If I just let them run and they're completely new, it's probably not going to be exactly what we agreed on if they're still learning.
So that's the way I see this: what you do with eval is, at independent checkpoints, can you make sure that those things are good? But do they guardrail against everything? No. That's how I see things playing out in the future: there's going to be evals, and then there's going to be human in the loop. Great name for our podcast. But yeah, maybe you two have your own ideas about it.
Sam: One thing that's tricky there, and I like the example of a new hire, is then if you have this human in the loop resetting and saying, "Yep, everything up until that point was correct," what happens with progress when it goes wrong? And you say, "Hey, actually that step was not correct." Do you then have to go and manually do the task for the LLM? Do you then let the LLM try again? Do you let it try a few times maybe so it can learn on its own? Do you just give it the answer? It actually is pretty similar to this concept of mentorship and guidance as well. So I'm curious how you think about that and when do you decide, "Okay, I really need progress to happen at this point."
Felix: Yeah, that's a great question. I think it obviously depends on the business use case. There are going to be certain situations where it has to be 100% accurate up to certain points. So maybe like finance, for example, you have to get it right. So you would need to exit out and be like, "Alright, humans, please help complete the task."
But there are other things that maybe can roll out for longer. Like, let's say, "Help me write a document." I don't care if you try 10 times today; I need the document by tomorrow. So if you try 10 times or five times, as long as it's done by tomorrow, you could roll out.
But this is where I think there's so much opportunity to be had. Think about what we just talked about: data collection is super important for evaluation. And then, we also didn't really even talk about how you improve after the evaluation—taking on the data and giving it to Sam and saying, "Great, you've got a bunch of labeled data now. Fine-tune, improve, do all this sort of stuff." Imagine you have an agent that goes off and just does all these rollouts and then you just get a binary yes or no from a human, or "Did this lead to something good or something not?"
Essentially, for AI to improve, the AIs that will improve really quickly are the ones that do get a lot of shots on goal. You get a lot of shots on goal; tomorrow, Sam wakes up. Not only is his document done, but he has like a hundred examples where 50 of them were good and 50 of them were bad. I'm sure you could train a model with a pretty low sample size if you just let that agent run and you basically only clicked a few buttons and said "Approve" or "Deny."
To me, the vision of the future is: humans are like executives. You sign checks, you say yes or no. And agents are your executors. Obviously, I'm oversimplifying a ton of this stuff, but in reality, as we say in all of our podcasts, you have to actually go and investigate and do the dirty work of figuring out which paths you have to guardrail and not.
But yeah, to answer your original question, I do think it's important for you to try to let the agent attempt a few times, as long as you are willing to take on that extra burden of it taking a little bit longer. It'll make the system improve a lot faster.
Clemens: Yeah, I really love your analogy of hiring an engineer and telling them to do a task because we want to think about agents more in this way of solving problems like humans solve problems, in a much more similar way. There's, however, this whole new dimension of evaluation almost that's coming up, which you alluded to. One of the critical questions is, of course, using your metaphor here: how do you instruct your agent, or your engineer, when should they come to you with a question? You mentioned there should be a couple of retries, but this question of escalation, or when does the engineer ask for help, or in this case, of course, when does the AI agent actually escalate to a human? And how do we define that interface for that handoff? I know that at Scale, of course, we're doing a lot of research in that dimension as well, but I'd love to hear both your thoughts on that particular interface between human and machine.
Felix: I found something interesting: these models tend to crutch on, "Let me ask for confirmation." I'm sure even with a decent amount of prompting, they often still say, "Oh, I need to ask somebody," or "I need to do something." And I think about it, it's very reasonable. Why? Because all, or most, of the applications we see today are a lot of them are chat-based, short, single-session based things. So a lot of the data that the model builders have today is conversational data. Conversational data natively is going to be biased towards, "Let me confirm," or "Let me ask you." So they're not as conditioned for these super, super long-running things.
My observation was, Clemens, as you said, ironically, they ask a lot. Too much, in fact. "Please take care of it," is what I would like to say. And I think you come up with interesting ways to continue the process: multi-agent systems, actor-critic systems, reward modeling. There's a whole bunch of ways, and this was a huge research thing back at Amazon when I was doing robotics. We had some RL techniques that I would love to try in this space as well. I don't have the answers in terms of what the best way to do it is, but I do know the problems. So I think it's just a really, really exciting space. You two know that Scale is working on this. We have huge things coming up in terms of planning for this future. I think we're going to get a ton of information by the end of this year even about how do you make things run for a long period of time with certain guardrails in place. So, the podcast in, whatever it is, December, that's going to be a really, really good one, so you should tune into that one. But definitely a lot of exciting work to be done.
Conclusion
Sam: Yeah, I think it's really interesting when you think about the question that was just posed of making sure that the agent raises things to you the right amount or at the right moments. What type of problem is that fundamentally? Well, it's an evaluation problem.
Clemens: Yeah, exactly.
Sam: It's like starting the podcast over again. I think there's a lot of really exciting work for us to do in this space around evaluation.
That's all for today's discussion about AI evaluation in enterprises. If this type of work sounds exciting, we're hiring. Check out our careers page for openings. And then, next week we're doing a Red Teaming 101 for enterprises. So make sure to subscribe so you don't miss it.