General

Human in the Loop: Episode 2 | Technical Considerations for Enterprise Agents

byon May 1, 2025

Welcome to Human in the Loop. 


Scale is the humanity-first AI company. We believe the most powerful and responsible AI systems are built with humans in the loop. And now, it’s also the inspiration behind our new video podcast series Human in the Loop: your guide to understanding what it takes to build and deploy real-world enterprise AI systems.

We share our learnings from working with the largest enterprises, foundation model builders, and governments to equip you with the insights you need to build practical, powerful AI systems in your organization.

About the Episode

In this episode, Scale's Head of Product Scale GenAI Platform, Clemens Viernickel, Head of Engineering, Enterprise AI, Felix Su, and Head of Enterprise ML, Sam Denton, dive into the core technical challenges of making agentic systems actually work in enterprise environments—across real data, real constraints, and real complexity.

Watch the full episode or read the transcript of their conversation below. 



Episode 2 - Technical Considerations for Enterprise Agents

Sam: Today we're going to be talking about some of the technical challenges and considerations for building enterprise agents. To start, Clemens, can you walk us through how Scale sets up agents for enterprises?

Clemens: Definitely, my pleasure. When we think about building agents for enterprises to solve custom problems, we focus on three core elements around which we've built our approach and our product.

The first is capturing the business logic for the problem we're trying to solve in the agent. Then there's a data flywheel component; we're trying to build agents that actually get better as they're used more. And the third is, we need a systematic way to measure how good these agents are to ensure they're ready for prime time.

Step 1: Capturing Business Logic

Clemens: Talking about each of these individually to give an example: Capturing the business logic. We've built an agent orchestration framework here at Scale and an agent execution environment that enables us to capture arbitrary business logic into the agent workflow.

There's a lot of logic that goes into very long documents, and you can't just prompt a language model to produce all of this in one shot because these documents are just way too long and involved.

We've built an agent orchestration framework that enables us to capture all of that intricate logic into a workflow that can be nested and have multiple different levels. This framework also dynamically integrates with all kinds of data sources and tools to essentially capture that logic.

Step 2: The Data Flywheel Component

Clemens: The second part is the data flywheel component. We are strong believers at Scale that agents can only really perform well if they're actually trained on the feedback that humans give while using them.

We want to have an agent that produces an initial draft of a document and then use the approval or the edits that experts make to this initial draft output as training data to make the agent better. We often call this process of capturing expert feedback to build training and evaluation data for agents the "Agent Monitoring Protocol." It's a key thing we've built into the platform: instrumentation to trace all the executions of the agents, associate human feedback with these executions, and have that data be the centerpiece of making the agent better.

I also refer to it briefly as a flywheel because the beautiful part is, the more the agents are used, the more of that feedback and direction data we produce. Consequently, the better we can make the agents, which, by definition, will almost always lead to us being able to improve the agent, which then leads to more adoption.

Step 3: Measuring Agent Quality

Clemens: The third part is a systematic way of measuring quality. Of course, you can improve the agent using the data, but you need a good way outside of gut feeling to actually assess if the agent is improving.

Is it actually able to capture and cover all of the critical quality bars that our business logic requires, for example, for generating these complex documents? That's why our product, the Scale GenAI Platform (SGP), also has extensive instrumentation to run evaluations – meaning the ability to very dynamically set up evaluation test sets and evaluation criteria.

Then, on the platform, we can dynamically run evaluations, either using humans in the loop or fully automated using LLMs as a judge. This enables us to measure not just whether the system is up to the quality bar, but also whether it's improving over time.

Sam: Awesome.

Challenges in Building Secure Systems

Sam: Now that we've set the stage conceptually, Felix, do you want to talk about some of the challenges that you face in actually doing this?

Felix: For me, a lot of my daily work goes into making the system secure. One of the most important parts about this agent framework is that it needs to connect to enterprise data to be useful. We're not building toy applications here where we can connect to some folder and have it work for your personal use.

We're talking about a large enterprise with a lot of access controls. You have to deal with compliance and security. These are very complex situations.

One of the most important things we offer is the ability to deploy into customer tenants. We deploy as single tenants for isolated data resources, and we also have multi-tenant options.

This level of flexibility allows us to offer different levels of speed and security for customers. Over the past couple of years, we've filled out thousands of vendor security questionnaires dealing with compliance, legal concerns, and that sort of stuff, to ensure that our system and our platform keep all this enterprise data secure while bringing AI to these customers.

In terms of why all this work is really important and nuanced: There's data in all different sorts of places. To make successful AI agents, you have to connect these data sources. Something that executives can definitely empathize with is the frustration they have sitting on piles and piles of information but not being able to pull it together in a reasonable way.

That's where Scale comes in. As a baseline, as a foundation, we offer this security, this sense of security, and these best practices. Then we have engineers go into the systems, work with the customer engineers, work with the security teams, and do the hard work of pulling all this data together so that Sam's team can actually build the AI on top of it.

Handling Backend Permissions of Data Sources - Permission Boundaries

Sam: I wanted to pause for a second and ask you about how you handle permissioning on the backend side for all of these data sources.

You talked about how you have data in all these different places and you want to make sure it's all secure. But one of the things that all enterprises we see are struggling with is that all of these data sources have different permissions.

Felix: Yeah, great question. Clemens, I'll hand it off to you to talk a little bit about the platform in a second.

But for us, the way we do it is we natively ship with an identity service, which handles all these access controls for us. Every single user is essentially a certain persona.

That persona has a certain set of permissions granted to them by admin controls. Admins have the ability to assign different permission boundaries to different groups of users and also segment each of these users into different groups, which we call accounts. We guarantee data isolation between these accounts and across permission boundaries.

You have the flexibility as a platform user to control these different segmentations. If I wanted to build an HR AI agent, for example, I can put that in a separate account. I can have different users with different permission boundaries as different levels within that account to restrict what levels of access they have within that AI agent.

That entire system will be completely isolated from a separate account, maybe for customer support. These two systems don't want to see each other at all, don't even want to know about each other's existence. That's something that's really important to enterprises and something that ships natively in the platform.

Maybe Clemens, do you want to talk a little bit about how this works?

Implementing Permission Boundaries in SGP

Clemens: Sure. I think it's a super important point that's often talked about much less because these are the less glamorous parts of enterprise-grade platforms. I talk a lot about agent orchestration, having a data flywheel, having evaluation, but a very big part of why enterprises love to work with us is that we have done a lot of the heavy lifting to deploy these agents, or full end-to-end applications with the agents behind them, in enterprise-grade environments.

One of the big elements, as you correctly say, Sam, is having proper identity and access management. We have a homegrown identity service that essentially manages all of these tasks that Felix just alluded to, including having the ability to grant access permissions to each individual resource managed throughout the Scale GenAI Platform.

Customer Example of Permission Segmentation

Felix: Yes. And just to quickly touch on some things: Actively, we have a ton of customers that use this feature. One of the largest banks in the world is building hundreds of gen AI applications.

That's their charter. To do that, they have to have proper segmentation. When you think of a financial firm, you think about security, data privacy, and different teams working on different things. It's really important to them that we get this right and that they have the ability and the flexibility to manage it on their side because they have compliance rules that they have to follow.

Sam: Yeah.

Handling Evolving Data

Sam: On the topic of things that sort of "just work," I wanted to talk about how LLMs and Gen AI have transitioned over the last year and a half, and will continue to over the next year and a half. One of the things that's really challenging about making things work in the enterprise setting is that when you change these designs of how you use LLMs and gen AI applications, you're also changing the data behind the scenes.

Working with this evolving data is really challenging. I'll talk a little bit about some of the experiences we've had with that. If we look back a year and a half ago, everything was really focused on RAG (Retrieval-Augmented Generation), and you were actually forcing the LLM to do RAG every single time. Now, I think we're at the point where we're able to leave the decision up to the LLM to get to an enterprise-quality decision. But we're constantly thinking about the future of agents and where they're going to go – where you have an LLM that has access to hundreds and hundreds of tools and is choosing between these hundreds of tools.

The Importance of Building in Structure

Sam: But the reality is, this isn't there yet at the enterprise level that we need. If you think about every decision being 95% accurate, if you make 10 decisions, then you're only going to be about 60% accurate overall.

Along the way, we kind of have to give ourselves a little bit of a skeleton or structure for working with these agents to ensure that the decisions are easy enough that we're making enterprise-grade solutions with enterprise-grade accuracy and quality.

If we think about this big transition from RAG being a hard decision to agents that have hundreds of tools, we want to be capturing data the whole way. We want to make sure that as these agents evolve, that data is still useful for future agents.

I think that's one of the things that we've really started to focus on in terms of how we fit into building agents into the Scale GenAI platform: making sure that the data we're capturing and whatever structure we're using – whether it's a rigid system where LLMs are only making one or two decisions at a time, or whether it's this future where you have an LLM with 10+ tools deciding how to call these tools in what order – ensures we're capturing the right data along the way, and that we're using structures and traces that we can learn from no matter what state of building agents we're in.

How Do You Capture Decisions for Agents?

Clemens: Yes, I think this is a really great point. One thing I would love to dive a little bit deeper into there, Sam, is when you talk about all of these hundreds of decisions that these models, mostly reasoning models now, need to make. They have so many tools, and we don't really know which tool they are going to use. How does that translate into what we talked about earlier regarding the Agent Monitoring Protocol and how we think about tracing? How do these decisions make it into that? How are they reflected?

Sam: When we talk about capturing these decisions, there are a lot of different protocols. You'll hear things like MCP, chat completion, tool calling, tool completions. I think the key is that we standardize these things as best we can, such that we can access this data in a standardized way for the future.

This will allow us to essentially take these decisions and gather information about how many tools were called. Standardizing the way tools are called and the way tool responses are fed back into the LLM around a single execution framework is how we actually make the monitoring and traces useful for the future.

Felix: Sam, I did want to ask one question about something I think is very important for people: What is the difference between building an example – like, something comes out, I'm going to try it, I'm wowed by it, MCP comes out, I connect it to my Slack, it's really cool, and I build an agent – what's the difference between that and an enterprise-grade agent?

As an ML engineer, how do you feel you've evolved, and what do you think about those two things: toy problems and enterprise problems? What's the biggest difference?

Sam: The toy problems are really great starting points for how we want to think about the vision of these enterprise problems.

How to Bridge Gap Between Toy and Enterprise Grade Problems

Sam: The way I think to bridge that gap is really around two things. One is leveraging enterprises to help us drive home the goal, the expected outcomes, and giving us this feedback that we can then use to essentially simulate these tools in an environment to find the correct reasoning traces that actually get us to the solution we're looking for.

Once you create this environment where you can understand the enterprise problem, you can let these reasoning agents explore the right path to get there on their own. Then you can pull out the traces that actually got there, the traces that are aligned with the expectation. All of a sudden, you have enough examples, as you said, of these traces and these reasoning paths that brought you to the right end state, that you can then learn from them. You can train models on them, and you can seed different parts of your system with few-shot examples or retrieve certain things that might be helpful.

Conclusion and Next Steps

Sam: Alright, well, I think that's all for today's discussion about some of the more technical things that enterprises should keep in mind as they work on building out these agents.

It's truly a really hard problem, and I think it's okay to accept that, embrace it, and work through it. Next week, we'll get into how we use agents to capture the institutional knowledge trapped in enterprise subject matter expert heads. This is where agents get truly powerful, and it's why it's worth navigating all the challenges that we laid out today. Make sure to subscribe so you don't miss it. 


The future of your industry starts here.