Human in the Loop: Episode 3 | What Data Do I Need for Effective Enterprise Agents?

Welcome to Human in the Loop.
Scale is the humanity-first AI company. We believe the most powerful and responsible AI systems are built with humans in the loop. And now, it’s also the inspiration behind our new video podcast series Human in the Loop: your guide to understanding what it takes to build and deploy real-world enterprise AI systems.
We share our learnings from working with the largest enterprises, foundation model builders, and governments to equip you with the insights you need to build practical, powerful AI systems in your organization.
About the Episode
In this episode, Scale's Head of Product for Scale GenAI Platform Clemens Viernickel, Head of Enterprise ML Sam Denton, and Head of Product, Enterprise Solutions Ben Scharfstein break down how enterprises can capture agent-ready data from real expert workflows—and use it to build adaptive AI agents that are actually effective in your enterprise.
Watch the full episode or read the transcript of their conversation below.
Episode 3 - What Data Do I Need for Effective Enterprise Agents?
Clemens: Today we are going to talk about enterprise data specifically. When we discuss data in the enterprise, people often think about context data – all the documents or information an enterprise has stored. But there's a very large second amount of data, which is typically stored in people's heads, and this is often the most valuable type.
Unlocking Institutional Knowledge
Clemens: To kick us off, Ben, why don't you dive a bit deeper and tell us how we're thinking about enterprise data, specifically the data locked inside experts' heads?
Ben: Certainly. Clemens, as you said, there are really two types of data in the enterprise. You have context data, which you can think of as the information someone would look at on a screen to do their job. But that's just a small subset of what makes an enterprise work and the data that makes an enterprise valuable. As you said, we think of this as the institutional knowledge in the enterprise, locked inside the heads of subject matter experts.
Most employees at a company can't perform at one hundred percent on day one, even with years of industry experience. There's a lot of institutional knowledge that comes from learning on the job, gaining experience, and having a history of performing tasks. It takes a lot of time looking at contracts, understanding how to work with customers, the nuance of all these things, and then putting that into practice to serve customers.
Training AI Agents with Expert Data
Ben: When we're thinking about training agents, it's important to unlock this latent data—the things you observe while doing the job that aren't necessarily written down. It's the type of knowledge you gain through apprenticeship. The reason this is so important as we're thinking about the next generation of agents is that when we interrogate what we want agents to do, it's not just the things software does today. We're trying to get agents to do the things humans do today, whether through augmenting them with co-pilots or, potentially long-term, replacing at least aspects of what they do.
Clemens: How would you contrast this kind of data with what models are already trained on?
Ben: That's a great question and something we think about a lot on the enterprise team. In general, models are trained on all available public data. If you just take public data, you'll get the average of what everyone thinks. But we also know that companies have advantages, like process power and institutional knowledge.
So, that's one of the biggest differences. What really makes it special for enterprises is that this data is specific to them and, in many ways, is their advantage. We need to capture it, turn it into AI-ready or agent-ready data, and then leverage it for those companies to make agents specific to them, their processes, and their expertise.
Agent Assistance vs. Full Autonomy
Sam: How do you think about the difference between agent assistance versus agents with full autonomy? This is something we've come across a lot in our different projects and with clients. There's a demand for both, and maybe we're reaching a point where they will be the same thing, but for now, it seems somewhat separated. I'm curious how you think about the data captured in both scenarios.
Ben: You can think about the levels of autonomy similarly to how you might think about the levels of self-driving. In self-driving, first, we had cruise control, then maybe lane assist, and then auto-following a car. That's kind of where we are now with agents. We're trying to get to full autonomy, but to get there, you need a human in the loop to build these co-pilots that allow you to capture this data.
We don't just jump to full autonomy because this data isn't necessarily out there in the world; it's not easy to model reward models so clearly upfront. So, we need to observe what humans do to get there. To your point about the difference between a fully autonomous dataset and an augmented dataset, it's partly in the goal. When you're augmenting a human decision, you're not necessarily trying to make the decision, but you're trying to capture the data relevant to making it. This will then move us to further levels of autonomy and agent data to actually make that decision. We have to build up this pyramid over time. It's difficult in many cases to jump to the end, both because the human-computer interaction may not be ready, the industry may not be ready, or the company may not be ready. There's a lot of process and change management involved. Also, there are times when we really do want humans in the loop, and it's important for humans to make the final decision. But if we can capture that agent data to help augment decisions, that's really valuable for the company.
Sam: That's a great point. As we continue down this path from assistance to autonomy, it's important that we continue to leverage our data capture and agent data protocol within SGP to set ourselves up for success, not just now but in the future as well.
Ben: Totally. And this agent data capture isn't just about what people are doing at the company; it's also about how their customers are interacting, how agents interact, and how humans interact. The lines are blurring over time, and having a very expressive way to capture actions that either human agents or AI agents take is core to what we're doing at Scale.
Building and Integrating AI Agents
Ben: Clemens, do you want to walk us through how we're doing this for our enterprise customers inside the solutions we're building for them?
Clemens: Definitely. We've talked a lot about how important it is to capture the data inside experts' heads. That data capture principle, or the Agent Monitoring Protocol, is at the center of how we've built our product, SGP, around that cycle. We roughly understand the process in a four-step manner.
First, implement an initial version of the agent. Second, integrate it directly into the workflow of a subject matter expert, like a lawyer or a person working in a life science company. Third, capture the human feedback and the interactions that run through this agent. And lastly, take that data and train the agent using that human feedback and interaction data. So, to recap our four-step process: implement, integrate, capture feedback, and then train with that data.
What's beautiful about this workflow is that it optimizes for speed. We've built all the instrumentation and tooling to do this as fast as possible to get to the first version.
Then, when we say we integrate it into the expert's workflow, that's crucial because we don't want to add extra work for the subject matter expert. We don't want them to do data work like creating training data themselves or doing evaluations outside of their core job. Instead, we want them to, in quotation marks, "always just do their job." So the lawyer, instead of doing the extraction manually, gets a first draft from the AI and just has to correct it. They're already getting value-add and just doing their regular workflow or job. That is the exact data we need to capture and harness to make the agent better. Because it's a cycle, it also improves over time. The more the agents are used, the more data we produce, and the better we can make the agents. This is the adaptive learning system that we at Scale believe very strongly in: we capture data to have agents get better as they're being used.
Ben: Sam, I think you can talk through maybe how that encoding works and the various different techniques we use there.
Sam: For sure. In terms of how we think about where this agent data can come in and how this codified latent knowledge of the SME can be used from an ML standpoint, it's about two separate levers. One is taking that information and influencing the way these LLMs generate content themselves. The second is the context and the plumbing around what makes the LLM useful. Clemens, you talked about these agent configurations; we talk about retrieval functions and different tools available. That's the second lever. The first is the LLMs themselves.
So, I'll talk a little about how we take the data from these SMEs and influence both these pieces. When it comes to model capabilities, I like to think of this as a massive domain adaptation space. As we've talked about, the data used to train these LLMs is quite different from the enterprise SME knowledge space. Our job is to help push these models into the domain most helpful to these enterprise SMEs. This comes in many different forms. What we want to do is take this data we capture with SGP and use it within LLM generation.
This helps the agent and the LLM understand how these assistants are most helpful for these SMEs. This can also be expanded. Often, we'll look at these traces SGP captures and find conversations that are important domain seeds—the type of conversations SMEs like to have with these assistants. We'll take these domain seeds or typical questions and use them to scale up a really large dataset, potentially synthetically, to essentially learn the different types of conversations we want these agents and LLMs to have. For these synthetic conversations, we can use things like verifiable rewards or checks to ensure they're within certain parameters, creating a large dataset of high-quality example conversations.
Once we have the synthetic dataset, there are many things we can do with it. We can fine-tune the LLMs or use model-teacher distillation techniques. Again, we can also use them as knowledge the agent can pull from at runtime, in a knowledge base or something similar.
The last point I'll talk about on LLM adaptation is some of the work we've been doing recently around verifiable rewards. We believe that in the enterprise setting, many of these problems can be codified as verifiable rewards, and the traces SGP captures allow us to compare against things we can then verify.
Clemens: One thing I'm super excited about, Sam, thanks for explaining this in great detail, is that when we talk about this with customers, maybe two years ago, many enterprises thought they could just train models on all the data they have sitting in their data warehouse. The reality, of course, is that type of data isn't really what you can directly train on. So, what we then told them, or what people thought, was that they would need to translate all the data they have stored somewhere into training data. That sounds horrible to enterprises—suddenly, all their experts need to do a ton of data work, which is a non-starter.
So, Ben, perhaps you could talk a little about what we're seeing with enterprises, how they get excited when we tell them we've developed a way to passively capture that data from their experts, so they don't have to do any of this manual data generation.
Ben: Absolutely. This is the product we've all been waiting for. The promise of machine learning for the past decade has been: the more you use it, the better it gets. For LLMs, that hasn't really been the case. Improvements have largely come from foundation models, which in turn improved agent effectiveness. I think for the first time, we're going to be able to super-specialize for different enterprise applications by using subject matter expert data, capturing this data, and bringing it back into agents.
Maybe a year or a year and a half ago, everyone thought reinforcement learning or, specifically, fine-tuning was how people would customize agents for the enterprise. What we actually saw was that giving few-shot examples or in-context learning was more effective. The way to think about this is quite intuitive for how you might train someone new at their job. In your first three months, if you have a hard problem, you'd go to your boss and ask for examples of how the problem was handled in the past. They'd give you some examples, and you'd copy those, maybe adapting them to the new problem. But over the course of your career, say the next five years, you'd learn the generational knowledge that leads to making those decisions. So instead of learning by copying, you're learning through intuition and experience.
That's what we're getting to. This comes through fine-tuning and reinforcement learning with verifiable rewards. In the same way that when you do a good job at work and your customer, client, or user says it was good, you get a positive feedback loop. That's what we're doing with these agents. It's just taken us time to build the infrastructure to create these applications in a way that we can develop co-pilots to go through these different layers of autonomy and capture this data in the right format. We built SGP specifically around not just delivering an agent, but building an adaptive learning agent that gets better over time. That's what we've been spending our time on, and we're really starting to see it play out and provide a lot of benefits to our customers. They're seeing it through better agents.
It's table stakes that you can build an AI application that can do a workflow. But now, if you want to get to the level of quality most companies expect, you really need that agent to improve over time, learn from your subject matter experts, and codify the expertise of your enterprise.
Clemens: A big item, of course, is also how these agents use all the other software available to the enterprise. Enterprises use hundreds of different pieces of software, APIs, and legacy systems. So, a big topic is how these agents use tools in their day-to-day operations, how tools play into this data capture dynamic, and why this implicit data capture is so powerful. Sam?
Sam: It's really important to acknowledge that for agents to be effective, tools have to be a first-class citizen. They have to be something we're constantly thinking about and integrating as easily as possible. At SGP, we've spent a lot of time thinking about how tools integrate into the agents we're building, and I think we're starting to reap the rewards of that now.
In particular, I also want to call out that this training paradigm we have also makes tools a first-class citizen. We can look at these traces, see examples of how tools were called, and then learn from those tool calls directly, having tools called during rollouts for reinforcement learning algorithms. This is a massive unblock for making agents useful in the enterprise space. Again, all of it comes down to having the traces and the rewards we want from SGP, from these examples, and then using tools not just during inference but during RL rollouts and training, so we can teach these smaller models how to effectively use tools and how to choose between them. That's the stuff I'm most excited about for us in this space.
Clemens: Super exciting. The tool calls being directly in these interaction traces is how we can implicitly include them in the training, which is super exciting. Okay.
The Future of AI Agents
Clemens: I think that's all for today's discussion about agent-ready data and capturing all that hidden knowledge inside the heads of subject matter experts to unlock high-performing agents for the enterprise and augment these experts with AI. Next week, we are going to dive deeper into the future of agents, which is, of course, a very rapidly evolving field, and how we're thinking about that at Scale.