George Pu

Machine Learning Research Engineer

General AI models struggle with enterprise workflows. Learn how Scale's specialized RL agents achieve superior accuracy.

Why Enterprises Need Specialized RL Agents

Beyond "Out-of-the-Box": Why Enterprises Need Specialized RL Agents

In order to create and maintain the best AI agents available, enterprises need to be constantly evaluating and improving them. Doing this manually is both slow and expensive, so at Scale we solve this problem using LLMs as judges. The way we do this is simple: we use the LLM judges to compare two agents that are identical except for one small variable. This allows us to reliably determine what actually causes a shift in behavior, which is the foundation of making principled, incremental improvements.
These evaluations, however, have a critical flaw: they are not repeatable over time. Our internal experiments revealed that the exact same test can produce metrics that change by as much as 10-15% from one day to the next. This level of instability makes it impossible to trust the results, turning a principled process into a game of chance. In this post we&rsquo;ll share the data that shows how this is an industry-wide problem, explain its root cause, and detail our practical &ldquo;cohort of judges&rdquo; method for solving it.&nbsp;
This foundation of reliable measurement enables all of our advanced agent improvement work, including our research in <a href="https://scale.com/blog/enterprise-agent-reinforcement-learning">reinforcement learning</a>.
<h2 dir="ltr">An Industry-Wide Problem</h2>
Our investigation began with a project to evaluate a student assistant chatbot, where the aim was to assess its performance across key dimensions like avoiding direct answers, resisting jailbreak attempts, and staying on topic. When we used LLM Judges, the results quickly showed that the exact same evaluations were not repeatable.
<ul>
<li dir="ltr" aria-level="1">
On jailbreak resistance, the same model scored 77% one day and 63% the next.
</li>
<li dir="ltr" aria-level="1">
On refusal fidelity, it swung from 71% to 81% across consecutive runs.
</li>
</ul>
This variance occurred in our tests of models from OpenAI, Google, and Anthropic, with the following ranges of variance, suggesting a systemic challenge for current LLM provider APIs:
<ul>
<li dir="ltr" aria-level="1">
OpenAI (GPT-4 variants): &plusmn;10&ndash;12%
</li>
<li dir="ltr" aria-level="1">
Anthropic (Claude variants): &plusmn;8&ndash;11%
</li>
<li dir="ltr" aria-level="1">
Google (Gemini variants): &plusmn;9&ndash;14%
</li>
</ul>
The variance in evaluation metrics had a margin of error large enough to invalidate any A/B test. This makes principled, incremental improvement nearly impossible. To achieve a 50% improvement in an agent's performance, a team might make ten small changes, each expected to contribute a ~5% gain. However, an A/B test cannot reliably detect a 5% signal when the day-to-day noise of the measurement tool is 10-15%. Positive changes can appear negative, and vice versa, making the development process inefficient and unreliable.
<h2 dir="ltr">Why Might this Happen?</h2>
There are a few potential reasons why this happens, some of them simple, others a bit more complicated. The simpler reason is because provider APIs have constantly shifting components; in other words, builders are always tweaking their models, so you&rsquo;re likely using a model with slightly different components today than you were yesterday. This is a key factor for next-generation frontier models as well, as users often lack the fine-grained control to select a specific, static model version.
The deeper technical reason is a consequence of the architecture used by most frontier LLMs and involves two key concepts: Sparse Mixture of Experts (MoE) and batched inference.
<ol>
<li dir="ltr" aria-level="1">
Sparse Mixture of Experts (MoE): Instead of being one massive, monolithic network, models tend to be composed of many smaller, specialized "expert" sub-networks. For any given input, the model dynamically routes the request through only a fraction of these sub-networks.
</li>
<li dir="ltr" aria-level="1">
Batched Inference: To operate efficiently, providers process many user requests simultaneously in a "batch."
</li>
</ol>
When these two techniques are combined, the results become unpredictable because the composition of a batch can determine which expert your query gets routed to. The mix of queries in a batch is not deterministic. For example, your query might be a math question, but if most other queries from other users in the same batch are related to psychology, your math query could be incorrectly routed to a psychology expert. This means the specific path your request takes through the model is influenced by the other requests being processed alongside it, making the output consistent only at the batch-level, not for your individual query.
For a more detailed analysis of this phenomenon, we recommend<a href="https://152334h.github.io/blog/non-determinism-in-gpt-4/"> this technical breakdown</a>.
<h2 dir="ltr">The Solution: A Cohort of Judges for Improved Reliability</h2>
Instead of relying on a single, noisy judge, we use a panel of three, which we call the &ldquo;cohort of judges.&rdquo; Each judge is given a slightly different prompt for the same task that is semantically the same. By aggregating the outputs from this cohort, we successfully smooth out the provider-side variance and produce a stable, repeatable evaluation metric. With this method, the variance in our evaluation results was reduced by at least 50% across the different dimensions we tested. This allows us to distinguish true performance changes from statistical noise, enabling us to trust our A/B test results and make principled, incremental improvements with confidence.
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=cce83318d48707f054309a6d376169a2.png&amp;f=webp&amp;q=75" alt="" width="1200" height="675">
<h2 dir="ltr">The Requirement for Reliable Evaluation</h2>
The underlying models we build on are not as fundamentally stable as one might imagine. This variance is a structural, industry-wide challenge that can introduce uncertainty into the evaluation process, creating risks for product development and roadmap planning. By adopting a cohort-of-judges approach, we can produce more reliable, repeatable measurements needed for decision-making. This practical step moves organizations toward a more robust MLOps practice, ensuring that the hard work of improving agents is measured on a solid foundation of trustworthy data.

LLM eval variance breaks A/B testing. Our new method ensures stable, repeatable results for reliable evals,

Smoothing Out LLM Variance for Reliable Enterprise Evals

At Scale, our Enterprise research team focuses on testing novel approaches to real-life scenarios, grounded in real customer data, so that we can further capability advancements that are actually applicable and useful in scaled, enterprise contexts. 
We take our findings from experiments like the one we share in today&rsquo;s blog and codify them into our MLE tool suites so they can immediately put the results into practice and meaningfully improve delivery for our enterprise clients. In the case of the following experiment, we developed a PlanByPythonExecutor tool that is already being used in production.&nbsp;
<h2 dir="ltr">Intro</h2>
Imagine you&rsquo;re an edtech, looking to train your model on a set of question and answer pairs from previous students across multiple domains like Math, English, and History. The issue? You&rsquo;re parsing Q&amp;A data from all kinds of legacy files: websites, textbooks, and more, leading to unstructured HTML, XML, and free text jargon. Your model can&rsquo;t parse the Q&amp;A pairs. It needs a cleaned version of this dataset.&nbsp;
Converting content and datasets into human-legible or vice versa, to LLM-legible data, is an open and interesting topic with a lot of potential for discovery. In fact, Andrej Karpathy discussed the challenges of the "LLMification" of human-readable education content&nbsp;<a href="https://x.com/karpathy/status/1961128638725923119" target="_blank" rel="noopener">on X recently.</a>
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=b483bae2f9d7f182dca89fb2136f779d.png&amp;f=webp&amp;q=75" alt="" width="1200" height="675">A sample snippet of your messy, unparsed Q&amp;A data set
So what do you do? You&rsquo;re most likely shouting: use an LLM to clean up the file! There are two options to doing this:
<ol>
<li dir="ltr" aria-level="1">
Ask an LLM to read each question-answer pair and rewrite them in a clean way
</li>
<li dir="ltr" aria-level="1">
Ask an LLM to write a script to parse all the question-answer pairs at once using a standardized library like beautifulsoup
</li>
</ol>
On option 1, we found that LLMs are prone to getting distracted by large amounts of context, missing key details, failing to persist tables, misinterpreting tags, and other issues.&nbsp;&nbsp;
In this blog, we mostly explore option 2 which introduces its own set of challenges such as needing to learn how to handle edge cases, writing the proper formatting code, allowing the user to equip the LLM with helpful utilities and more. We&rsquo;ll go into detail about a method we developed to use LLMs to call multiple tools to solve this problem.&nbsp;
The key insight is we didn't need to rely on the LLM to actually do the cleaning of the data, we just needed it to come up with a plan on how to clean the data. The plan can then be executed, and the feedback circulated back to the LLM in the form of formatted examples, failures etc so that the LLM can refine the plan.
We found the standard approach to toolchaining insufficient. Simply giving the LLM access to multiple tools and asking it to clean the data led to extremely poor results, if it could do it at all. Instead, when we gave the LLM access to a python sandbox pre-loaded with these tools and asked it to develop the aforementioned plan, the output was significantly improved. For the remainder of this blog, we dive into why this happened, how we set up our experiment, and what the findings mean for you.&nbsp;
<h2 dir="ltr">So what&rsquo;s the problem with the standard approach to toolchaining?</h2>
You might be thinking &ldquo;Surely this data formatting task is quite simple for an LLM?&rdquo;. We actually found that it was anything but!&nbsp;
A reasonable first attempt to solve this task would be to provide the LLM with a set of tools (eg: a bunch of formatters, utilities to read the data, etc.) and let it figure out how it wants to use them. However, we found this approach breaks down for anything more interesting than a toy dataset. The reason for this is the following: in this scenario, the LLM is responsible for the orchestration of the tools. As a result, all tool responses are routed back to the LLM, which then has to provide the output of this tool call as input to the next tool call.&nbsp;
While flexible, this method often becomes costly and error-prone. As outputs move from tool to tool, the LLM can unintentionally rewrite inputs, such as injecting random strings, altering formats, or losing critical information. In turn, this creates downstream errors, bloats the context window, and is something the LLM simply does not need to do.
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=3e767be1ebf3f50c4d9b182acf10dd78.png&amp;f=webp&amp;q=75" alt="" width="1200" height="675">
Consider the following example: let's say the output of a tool call is a complex, nested, 20000-character HTML string. Asking the LLM to reproduce it exactly as input for the formatter is fraught with danger - even a single missing character/tag would break the formatter downstream! Interestingly, in their new gpt-oss release, OpenAI tacitly acknowledges this challenge of tool-call chaining. To train this suite of models, they use a <a href="https://cookbook.openai.com/articles/openai-harmony#message-format">Harmony Response format</a> that provides a routing mechanism which will allow tool call outputs to be redirected to receivers other than the agent that called the tool, although they don&rsquo;t show any examples of doing this yet. This futureproofs this message format to scenarios where one needs to compose tool calls (tool_call_A output -&gt; tool_call_B input) or have a separate agent understand the tool call output (tool_call_A output triggered by Agent X -&gt; message for Agent Y)
&nbsp;
<h2 dir="ltr">To address this problem, we asked the LLM to formulate a plan and enable toolchaining</h2>
Motivated by how we as humans approach large, daunting tasks, we instead asked the LLM to formulate a plan. A Plan Executor Tool then carries out this plan and provides only relevant, curated context as feedback. In this way, the LLM is still able to refine its plan based on how well it performed, but is no longer required to orchestrate the toolchaining, and nor is it throttled by overwhelming context.&nbsp;
We found that LLMs are exceptionally strong at generating these plans and can reason through tool-based workflows, but their performance depends heavily on how tools are exposed and how plans are structured. We tested three distinct approaches to generating these plans: Plan as JSON with toolchaining, Plan as Python with tool injection, and Plan as Python without tool injection, aka a stateful code executor. We tested against a baseline of an MLE writing custom parsing functions to clean the provided data.
By enabling the LLM to generate a Python-based plan and leverage pre-built tools from our subject-matter experts and ML engineers, we achieved results nearly as accurate as human-only workflows at a fraction of the cost and time. Next, we dive into the details for the approach to the plans.&nbsp;
<h2 dir="ltr">We tested three distinct options for plans&nbsp;</h2>
Having established the need for a plan, the question then becomes what format this plan should take. We explored three options: Plan as Python with no tool injection, Plan as JSON, and Plan as Python with tool injection. We tested these three options against a baseline, handwritten set of custom parsing scripts.
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=1b0a312565f134539fa85cf60e56a1e6.png&amp;f=webp&amp;q=75" alt="" width="1200" height="675">
<h3 dir="ltr">Baseline: No LLMs, Handwritten, Custom Python Parsing Functions</h3>
To test the plan&rsquo;s effectiveness, we first set a baseline with no LLM intervention. An engineer wrote custom parsing functions to clean the data. Unsurprisingly, this process proved highly effective but took a full week to develop and test. The functions were also bespoke to this dataset, making them tightly coupled to the provided data.
<h3 dir="ltr">Plan as JSON with Tool Injection</h3>
The Plan as JSON consists of a name, a description and a list of steps that the executor must carry out. This might look like:
<pre class="language-markup"><code>{
"name": "LLMPlan",
"description": "This plan is designed for a data formatting task",
"steps": [
{
"name": "load_csv",
"action": "load_csv",
"params": {"path": "input_data.csv"}
}
...
{
"name": "run_python_code",
"action": "run_python",
"params": {"code": "&lt;LLM generated code&gt;"}
}
...
{
"name": "save_formatted_csv",
"action": "save_csv",
"params": {"path": "output_data.csv", "df": "$stepk.result"}
}
]
}
</code></pre>
The action in each step could be a tool that the LLM has access to, or python code that the LLM writes on its own. The crux is that, via this json, the LLM is effectively able to compose a sequence of tool calls by referencing results of previous steps without ever receiving the output explicitly.
We found that the LLM was able to accurately reason about the control flow (e.g., iterating over data, conditionally selecting formatters) and revise plans after failed executions. Overall the method proved workable when the LLM had access to a basic Python executor capable of invoking formatter tools. When analysing the traces of the model though, we noticed that most of the heavy lifting was happening within the python tool. This prompted the question: are we throttling the LLM with enforcing such a rigid JSON plan structure? What if, instead, we allowed it to write an arbitrary plan as python code?
<h3 dir="ltr">Plan as Python with Tool Injection</h3>
Next, we allowed the LLM to break free from its Plan as JSON shackles and write arbitrary python code. Where this differed from a vanilla code-interpreter tool was that we injected all the tools that the LLM would otherwise have at its disposal into its runtime. So instead of the classical tool-calling paradigm, we allow the LLM to focus on writing good code, while providing it with the inductive bias of certain utilities it can leverage (e.g. HTML formatters written by one of our MLEs). This seemed an especially natural segue given how models are getting more specialised and adept at writing code.
This approach still provided us the ability to not task the LLM to do things it isn't needed for, but at the same time give it flexibility to structure the plan in whatever shape it wants! With minimal human input, the LLM was able to leverage the provided utilities to format the data to a level comparable to that implemented by our engineers, often revising scripts and fixing bugs as it understood the data better.&nbsp;
<h3 dir="ltr">Plan as Python (aka Stateful Code Interpreter)</h3>
In this experiment, we give the LLM access to a single tool: a stateful code interpreter and ask it to carry out the task. We wanted to see if the utilities were indeed useful or if the LLM does best with a clean slate.
While the LLM was able to implement reasonable formatters, the quality of output in most cases was mostly inferior to that of the other methods that leveraged the utilities. Interestingly, however, the LLM ended up implementing a completely new formatter for a question type we had not previously identified and significantly improved the quality of the question-answer parsing as a result.&nbsp;
&nbsp;
<h2 dir="ltr">Here&rsquo;s what we found</h2>
In this experiment, we used LLM-as-a-judge to evaluate two things:
<ol>
<li dir="ltr" aria-level="1">
Completeness: Whether the LLM was able to extract the full question-answer pair (e.g.all conditions qualifiers should be present in the question)
</li>
<li dir="ltr" aria-level="1">
Formatting: Were all parts of the question + answer formatted correctly, accounting for different format types like tables etc
</li>
</ol>
&nbsp;
Overall, the baseline approach with handwritten parsing functions produced the best results. This is unsurprising given the logic was specifically developed for and tested on this particular dataset. However, Plan as Python with Tool Injection and Plan as JSON were both able to effectively leverage these handwritten tools to yield results that were almost as good. While the two plans did not differ meaningfully in performance compared to one another, Plan as Python provides more flexibility.
&nbsp;
The Plan as Python (aka Code Interpreter Tool) which didn&rsquo;t have access to our handwritten parsing functions overall performed slightly worse in our LLM-as-a-judge evaluations. Interestingly however, the LLM ended up implementing a completely new formatter for a question type we had not previously identified and significantly improved the quality of the question-answer parsing as a result. Thus, we show that overall the Code Interpreter Tool was inferior in our internal LLM-as-a-judge evaluations but for certain question types, it significantly outperformed all other approaches.
Here&rsquo;s a look at the results below:
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=2a80d6733932e2e34807ac38cb75993c.png&amp;f=webp&amp;q=75" alt="" width="1200" height="675">
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=043ea3ca6db6efaedf612b736e02e3b9.png&amp;f=webp&amp;q=75" alt="" width="1200" height="675">
To recap, we found that injecting MCP servers and tools into a python session that the LLM can write code in is very powerful. In general, we have shown the ability to get LLMs help us do this data formatting task and in a limited capacity, LLMs have found things that were better than a human.
&nbsp;
<h2 dir="ltr">What does this mean for you?</h2>
If you&rsquo;re tackling messy data formatting at large volumes and want to use LLMs, here&rsquo;s the takeaway: planning-based with toolchaining approaches work really well. They let an LLM break down a complex job into smaller steps and use the right tools along the way without asking the LLM to do things it isn&rsquo;t suited for or in fact required to do.&nbsp;
&nbsp;
<ul>
<li dir="ltr" aria-level="1">
Plan as Python is flexible and works well for open-ended tasks. You can give it your own helper tools or logic, and it will figure out how to use them.
</li>
<li dir="ltr" aria-level="1">
Plan as JSON is more rigid but easier to read and debug. It can still tap into the same execution engine as Plan as Python when you need more flexibility.
</li>
</ul>
&nbsp;
Going forward, we&rsquo;re looking at adding a feedback loop: a simple Python executor could write small helper functions for tricky question types, and we&rsquo;d feed those back into Plan as Python with Tool Injection. That way, we can get close to the accuracy of handwritten parsing while keeping the workflow general and reusable.
&nbsp;
<h2 dir="ltr">What&rsquo;s next</h2>
Thanks to this work, we now have access to our Plan as Python tool to bring to our various engagements allowing LLM Agents to freely use premade enterprise SME or Scale MLE defined tools. As a result, we&rsquo;ve experienced a significant reduction in the cost of cleaning up messy data (something we know we&rsquo;ll always need when working with enterprises 😀)
We&rsquo;ve integrated these planning and tool-injection capabilities into our internal MLE toolkit via a robust Stateful Python Executor Tool. This will allow others across Scale to prototype similar workflows on live customer data with minimal setup. Future directions include extending this framework to more complex agent loops, supporting self-healing pipelines (e.g., retrying failed steps), and fine-tuning our models on the traces by the LLM while generating these plans to further refine their ability to carry out tool-enabled loops.
Every day it seems like there&rsquo;s some new &ldquo;breakthrough&rdquo; with AI agents and while it&rsquo;s true that the pace of new agentic capabilities is astonishing, as an enterprise, what matters is whether those breakthroughs can actually work in an enterprise context. For our Enterprise research team, that&rsquo;s our differentiator. We start with real customer use cases and test advanced approaches against real customer data. Those learnings are then codified back into our platform, so every research finding becomes part of our delivery pipeline for customers, like these results of our data formatting toolchaining experiment.&nbsp;
Cleaning data is just one small step in the journey toward adopting AI in production. At Scale, we specialize in partnering with leading enterprises to tackle hard, high-impact problems like these. If you&rsquo;re struggling with a similar issue or just find this research interesting, reach out to our team at <a href="http://scale.com/enterprise/agentic-solutions">scale.com/demo. </a>You can also learn more at <a href="http://scale.com/enterprise/agentic-solutions">scale.com/enterprise/agentic-solutions</a>.

We gave the LLM access to a python sandbox and asked it to plan as a method for toolchaining

Toolchaining: The Problem No One is Talking About

Language models provide impressive general capabilities off the shelf, but because they were not trained on private enterprise data they fall short of delivering the specialized performance enterprises need for their unique workflows, internal systems, and proprietary data. At Scale, our Safety, Evaluations and Alignment Lab (SEAL) recently published <a href="https://scale.com/blog/future-ai-learning">foundational reinforcement learning research</a>, and our enterprise team is pioneering new approaches to solve this challenge through cutting-edge reinforcement learning research focused on training AI agents specifically for enterprise environments.
<h2 dir="ltr">Beyond AI Workflow Automation: A Scalable Agent Training Framework</h2>
Traditional enterprise AI implementations rely on workflow-based agents that require engineers to hand-craft specialized logic for each customer problem. This approach is inherently unscalable as it's time-intensive, brittle, and demands specialized resources like applied AI engineers for every new use case. Additionally, prompt engineering can only take agent performance so far &mdash; simply adding context via prompting can never replace letting agents &ldquo;<a href="https://storage.googleapis.com/deepmind-media/Era-of-Experience%20/The%20Era%20of%20Experience%20Paper.pdf">learn continually from their own experience</a>&rdquo;. Scale's research takes a more general and scalable approach: instead of engineering solutions, we train agents to learn the decision-making required to solve each task through reinforcement learning with verifiable rewards and tool integration.
Our technique lies in training agents that can autonomously decide which tools to use and how to use them, whether that's analyzing proprietary documents, conducting web searches, or executing complex coding tasks. We go beyond the capabilities available in commercial LLM APIs by leveraging reinforcement learning with enterprise-specific tools in the training loop. We're creating agents that learn to solve problems and make correct decisions using the specific responses from tools (traditionally known as observations in RL), rather than following pre-programmed workflows.&nbsp;
We build on <a href="https://scale.com/research/rubrics_as_rewards">research</a> conducted by SEAL showing that we can also leverage rubrics for problems that are not easily verifiable&mdash;i.e., they lack an unambiguous ground truth. Early results demonstrate that our reinforcement learning approach with tools and verifiable rewards significantly outperforms traditional supervised fine-tuning methods, leading to absolute accuracy boosts as high as 31% with RL vs. 12% with SFT on an internal tool-required benchmark.
<h2 dir="ltr">Delivering Enterprise-Specific Performance at Scale</h2>
We're actively implementing these capabilities with enterprise data across multiple domains: document analysis for specialized fields, determining outcomes for critical legal reasoning, web search agents, and complex mathematical and coding reasoning tasks. Our training methodology teaches agents to implicitly learn tool usage patterns, requiring only prompts and verifiable outcomes.
Our experienced MLEs work with our customer's subject matter experts to design the best implementation of reinforcement learning for each specific task, implement 'reward crafting' including deep rubric-building expertise for specialized domains, create novel environments for agent training on enterprise-specific tasks, and build infrastructure for long-running asynchronous agents.
While other companies focus on general-purpose models, Scale is developing multi-agent training capabilities and working with state-of-the-art open-weight models to achieve state of the art domain-specific performance by leveraging the agent-specific data captured by <a href="https://scale.com/genai-platform">Scale GenAI Platform (SGP)</a>. Our future research workstreams include scaling to more complex enterprise problems and multi-agent training.
The implications for enterprise customers are transformative: enabling performance for enterprise-specific agents that is simply not possible with other solutions. With Scale's Enterprise work in reinforcement learning, our customers are deploying agents that learn and adapt to their specific processes, delivering superior performance while dramatically reducing implementation time. Scale's Enterprise team is pushing the boundaries of reinforcement learning research to train superior enterprise agents.
To learn more about how Scale can help you implement reinforcement learning for agents to improve on your complex workflows, book a meeting below.

Scale's Research in Enterprise Reinforcement Learning for Agents

Enterprise Reinforcement Learning Research for Agents

Training the Next Generation of Enterprise Agents: Scale's Research in Reinforcement Learning

Scale&rsquo;s paper, Balancing Cost and Effectiveness of Synthetic Data&nbsp;Generation Strategies for LLMs (<a href="https://arxiv.org/pdf/2409.19759">arxiv</a>), has been accepted to the <a href="https://sites.google.com/view/neurips2024-ftw">Fine-Tuning in Machine Learning (FITML) Workshop at NeurIPS 2024</a>. This post overviews the research presented in that paper.
Paper Authors: Yung-Chieh Chan, George Pu, Apaar Shanker, Parth Suresh, Penn Jenks, John Heyer, Sam Denton
In this paper, the Scale AI team investigates various synthetic data generation strategies for fine-tuning LLMs under different constraints, paying particular attention to cost-effectiveness. This makes this paper especially relevant for real-world enterprise fine-tuning considerations.
At Scale, <a href="https://scale.com/enterprise/generative-ai-solutions">we help make AI work for enterprises</a>, so they can do better work. One way to make Enterprise AI work better for a particular enterprise task is to fine-tune LLMs. Fine-tuning requires two key steps: creating a high-quality dataset and fine-tuning the model to acquire new capabilities. We have found that most enterprises simply don&rsquo;t have high-quality, task-specific datasets sitting around ready to be applied to a model, creating a data bottleneck.
Various solutions have emerged to overcome this data bottleneck, including having humans manually curate datasets, automatic or synthetic data generation, and hybrid methods. Some examples of these solutions include:
<ul>
<li dir="ltr" role="presentation">Manual or automated enhancement of data quality</li>
<li dir="ltr" role="presentation">Increasing the dataset quantity</li>
<li dir="ltr" role="presentation">Extracting more informative learning signals from each data sample</li>
</ul>
While each of these methods has shown promise, their relative cost-effectiveness and performance across different tasks and data constraints remains unclear, especially in resource-constrained scenarios, as is true for most real-world enterprise use cases.
This lack of clarity poses a significant challenge for enterprises seeking to optimize their data generation strategies for specific tasks and available resources. 
<h3 dir="ltr">How we did it:&nbsp;</h3>
At Scale, we perform various experiments to better understand the data layer while building custom LLMs for various domains and tasks. Recently, we investigated this question using tools from <a href="https://scale.com/genai-platform">Scale GenAI Platform</a> to perform a controlled experiment grouping various synthetic data strategies and leveraging a set of small, high-quality prompts to teach a student model as much as possible from those few prompts using a smarter teacher model.&nbsp;
We train our models with Supervised Fine-Tuning (SFT), which involves a dataset of prompts and responses, and consider three main synthetic data strategies: Answer Augmentation, Question Rephrase, and New Question. Each strategy represents a different way of modifying the prompt or response space.
We propose a novel framework to evaluate synthetic data generation strategies under data constraints and demonstrate synthetic data effectiveness in new tasks beyond traditional mathematical and coding scenarios. We selected three types of tasks in Mathematics, General Question Answering, and Text2SQL domains.&nbsp;
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=39260c0163d4676d6d4267b23ffa4f61.png&amp;f=webp&amp;q=75&amp;w=3840" alt="" width="1503" height="935">
<h3 dir="ltr">What we found:</h3>
We show which methods squeeze out the most accuracy (y-axis) given a different number of starting prompts (per figure) and a different &ldquo;budget&rdquo; of synthetic data (x-axis). We also measure the "query budget ratio" &mdash; a ratio specified by q/s, where "q" is the query budget (number of times we can call an LLM) and "s" is the initial seed instruction size or the number of initial task-specific instructions available.
Surprisingly, this "shift" in the optimal method is consistent across the three tasks. With a limited budget, creating new responses is most effective. However, as we increase our budget, creating new prompts yields the most effective results.
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=9a9d49c2fd0a97317473ff6db91104ce.png&amp;f=webp&amp;q=75" alt="" width="1503" height="498">
We investigate and identify important factors affecting the performance of a fine-tuned model under constrained budgets. For example, model choice is a key factor in New Question but less important when performing Question Rephrasing.
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=76f69afa50ed7221ea461bc45f34b8f5.png&amp;f=webp&amp;q=75" alt="" width="1416" height="516">
In this study, we provide a framework to analyze the effectiveness of various synthetic data generation strategies for training LLMs under different resource constraints and task types. Our findings reveal that the optimal strategy hinges on the query budget ratio. Augmenting answers to existing questions proves most effective when this ratio is low while generating new questions becomes advantageous as the ratio increases.&nbsp;
We also find that the choice of augmentation strategy is less critical in data-rich scenarios, potentially leading to future cost reductions and efficiency improvements. Furthermore, question rephrasing is robust even with weaker augmentation models, highlighting the potential for cost reduction in specific scenarios.&nbsp;
Finally, our observations indicate that verification of synthetic responses and the specific choice of student model have less impact. These insights should guide practitioners in selecting the most suitable data generation strategies for more efficient LLM training within their specific constraints. 
<h3 dir="ltr">Conclusion</h3>
Our work addresses the fine-tuning data bottleneck challenge from a new perspective by offering a general framework that guides MLEs in defining and refining their synthetic data generation strategies to maximize cost-efficiency within budget constraints.
We can further improve the model development process by balancing the cost and effectiveness of our data generation strategies while understanding what design choices play a role. We include more details of our method, experiment framework, and results in our <a href="https://arxiv.org/abs/2409.19759">paper</a>, which has been accepted to the <a href="https://sites.google.com/view/neurips2024-ftw">FITML Workshop @ NeurIPS 2024</a>.
If you want to learn more about how these results can be applied in your organization, request a demo below or visit our&nbsp;<a href="https://scale.com/enterprise/generative-ai-solutions">GenAI Solutions page</a>.

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for Fine-Tuning LLMs