by Jerry Chan, Vijay Kalmath, George Pu and Sam Denton

TL;DR:
Enterprises are hitting a wall with general-purpose models. While readily available AI models offer impressive general capabilities, they fail at the most critical step: delivering specialized performance on unique workflows that require the use of employee skills, internal systems, and proprietary data. Simply put, this is the barrier separating AI potential from proven ROI.
Solving this high-stakes challenge is why our Enterprise team has spent the past year laser-focused on building out our agentic training stack, data generation pipelines, and reinforcement learning (RL) capabilities, all tailored for our enterprise customers. We've developed expertise in crafting precise rubrics and rewards that define enterprise success, and spinning up realistic datasets and environments. This work is solving the complete end-to-end challenge.
In this post we’ll walk through two concrete examples that tell the story of our standout results.
At the core of this work is a simple, powerful thesis: specialized enterprise agents are best built through a data flywheel, where agents are embedded in workflows, interacting with enterprise tools and receiving feedback from human employees, which generates high-quality data that fuels better training, which leads to better agents. We are ensuring the Scale Generative Platform (SGP) seamlessly supports every part of this loop, from data generation with traces and evaluation (e.g. rubrics) to stable RL training and deployment all in your VPC!
As we discussed in our previous blog on Training the Next Generation of Enterprise Agents, RL for enterprises is about training agents inside realistic enterprise environments, complete with domain-specific tools, structured tasks, and operational feedback loops. This is precisely how enterprise agents learn, through trial-and-error, to decide how to use tools, understand production data and user intent, and produce outputs that meet customer-defined success criteria.
Through our experiments, we’ve consistently found that four factors are critical for RL:
When these ingredients are in place, RL becomes a data-efficient way to shape model behavior and unlock measurable gains on challenging enterprise tasks.
Across our enterprise tasks, we fine-tune open-source models using RL with tools and achieve state of the art performance, outperforming leading LLMs like GPT-5, Gemini Pro 2.5, and Claude 4.5 Sonnet on accuracy and cost. In practice, evaluating RL methods in enterprise settings requires more than just beating a weak baseline.
As foundation models have grown more capable, establishing strong representative baselines has become essential for fair and meaningful comparison. These baselines involve equipping agents with the right tools, scaffolds, and prompting strategies. This ensures that improvements from RL truly reflect better reasoning and decision-making, not just better prompting.
Configuring RL environments for enterprise agents introduces multiple layers of complexity – from interaction structure (single-turn vs multi-turn) to tool integration (none vs. multi-tool) to reward design. We share results across two representative settings capturing this diversity:
We implemented a single-turn text-to-SQL (T2SQL) task for a global insurance company. For this task, models generated SQL queries in a structured output for given natural language questions about financial data across several domains. Using Scale Data Engine, we curated a dataset of 506 hand-crafted T2SQL examples from domain experts that captures the natural distribution of real-world query patterns. Alongside the company, we developed a knowledge base capturing proprietary information from schemas, example queries, and heuristics. Each training example retrieved three key pieces of context from this database:
Because the customer could not send sensitive financial database information to 3rd party APIs, we exclusively used self-hosted open source models for baselining and training. After experimenting with several SoTA open source models, we went forward with the Qwen3 models and the Qwen3-Coder-480B-A35B model for an upper bound performance. Notably, even this large Coder model only achieved 21.9% accuracy, highlighting the difficulty of the task.

Across variants, RL-tuned models achieve a ~2x improvement (from 18.8% to 40.6%) over base models in execution accuracy, surpassing even large-scale competitive baselines. Reinforcement learning proves especially effective at improving schema grounding and semantically-correct SQL generation, showing that even a single-turn setup without tools can yield measurable gains through reward-based training for both reasoning and non-reasoning models.
Building on this, we are extending this project into a multi-turn, tool-augmented environment where SQL execution becomes an interactive tool. This will enable agents to iteratively debug, replan, and verify queries, critical capabilities for closing the final gap to true enterprise-grade reliability and performance.
For our work with a leading global law firm, we evaluated a multi-turn data-retrieval and legal-reasoning task designed to evaluate how language models combine document search and reasoning over legal agreements. Each model must iteratively retrieve relevant clauses using a set of search tools and produce grounded answers. The model has access to 3 retrieval tools: page selection, text search, and semantic search. Our dataset includes 4k QA pairs across 30 real contracts that range from 20 to 100 pages long, with both questions and answers annotated and verified by lawyers.
We compare our RL fine-tuned models against the following baselines:

RLVR training yields 17-29 % absolute accuracy gains across Qwen 3 variants, outperforming both GPT-5 baselines. Similarly, RLVR training consistently outperforms these SFT baselines by ~10% absolute accuracy, underscoring the benefit of reinforcement learning in improving robustness and generalization beyond supervised fine-tuning.
Notably, we only needed a 4B parameter model to surpass GPT-5 baselines with a 256k context window (gpt-5 full-context baseline). This demonstrates with enterprise data and proper training techniques, even a small model can deliver state of the art accuracy and robustness.

For enterprises, these gains translate into more reliable AI systems. Our tuned models show a significant lower hallucination rate. When tested on unanswerable questions, the difference was stark:
Model Correctly Identifies as Unanswerable Hallucinates (Provides a Made-up Answer) Our RL-tuned Model 79% 21% GPT-5 (full context) 51% 49% SFT-Tuned Model 5% 95%
The SFT-tuned model's poor performance might be due to data imbalance, as only 5% of its training data was unanswerable. However, despite the same skewed data, our RL-Tuned model learned to reliably identify 79% of them.
Along the way of building out our training stack for agentic training and making sure the RL curves keep going up, we picked up several learnings of how to ensure RL works for our problems. We will highlight several key takeaways:
Data quality is one of the most important dimensions in whether your training runs are successful. We spend a large amount of time understanding the dataset, manually reviewing samples, and running automated filters to ensure data quality.
To evaluate the importance of data quality. We train Qwen 3 4B with the same configurations on unfiltered legal data. Removing the filters adds 500 unverifiable samples to our 2,750 samples training set, accounting for 15% of the contaminated dataset. The resulting model achieves 5% lower accuracy at 1k steps.

Environment and tools design. The environments we spin up should be representative to the model in production. In addition, we should ensure the tools the model uses are robust and performant, such as proper error handling and retry logic. For the legal task, our model’s tools were optimized by humans and our proprietary agent-building optimizer (more to come soon!) to improve the model’s score on a fixed benchmark.
For our global law firm partner, we found that performance improvements from better tooling in the base model translate to the RL-trained model, but doesn’t necessarily translate across model families. For instance, the performance of GPT-5 declined from 73% to 68% with the introduction of the Text search while GPT-4.1 saw an improvement from 64% to 67%. We recommend optimizing the environment design on the base model before initiating RL training.

Robust training infrastructure is critical for ensuring our agents can be trained stably across various environments. This goes beyond just compute, requiring support for the longer context lengths needed in agentic tasks and advanced sharding techniques for larger models. We also found that monitoring key metrics and applying algorithmic best practices such as removing length normalization, clip-higher, and removing KL are essential for ensuring rewards consistently trend upwards.
Finally, we found that RL works most efficiently when starting from a strong model prior. We are actively experimenting with the best techniques to build this prior, from dataset filtering and iterative fine-tuning to cold-starting with a SFT dataset before the RL run.
Mitigating reward hacking. We found instances during RL training where the model would perform reward hacking to circumvent a bad response getting a low reward. For our global legal partner, the model performed judge hacking and prompt injected the judge model to believe the question was unanswerable and give a high reward. We mitigated this by using a new judge prompt with explicit guardrails and length constraints to prevent prompt injections. We are continually monitoring reward hacking and improving our reward/rubric design.
Example of reward hacking with our legal partner: In our first iteration of the RL environment, we discovered model doing judge hacking via prompt injection

Model Original Judge New Judge Qwen 4B @ step 100 71% 56% Qwen 4B @ step 300 90% 32% GPT-4.1 65% 64% GPT-5 76% 73%
Our case studies prove that specialized, RL-tuned models can outperform massive, general-purpose baselines. But how we achieved these results reveals the most important insight of all: Data quality and training stability are the most decisive factors in whether RL succeeds. This is Scale's unique advantage. The race to build the best specialized agents will be won by whoever can build the best, most realistic datasets and training environments. Our combination of a world-class data engine and a stable, integrated platform is our (and our customers’) sharpest edge.
We highlighted two enterprise use cases where RL fine-tuning open-source models in these environments greatly improves model performance and are excited to bring this technology to customers through SGP. Going forward, we will continue investing in these efforts to improve our agent training infrastructure, post-training recipes, and scaling RL across domains.
If you're an enterprise looking to build specialized agents on your proprietary data, reach out to our team to get started. Or, if you're a researcher or engineer passionate about solving these challenges and are interested in joining our Enterprise Research Lab, we are hiring! Explore our open roles here.