November 7, 2025
Beyond "Out-of-the-Box": Why Enterprises Need Specialized RL Agents
While general-purpose AI models are powerful, they often fail to deliver on complex, specialized enterprise workflows that use private data. We share results from our real world work in the insurance and legal industries, highlighting how our RL-tuned agents outperformed leading LLMs and dive into how we achieved these performance gains.
Read more
September 15, 2025
Smoothing Out LLM Variance for Reliable Enterprise Evals
A critical challenge in enterprise AI development is the instability of LLM evaluations. Our internal testing revealed that metrics on identical A/B tests can swing by as much as 15% from one day to the next. This level of variance is large enough to invalidate results, making principled, incremental improvement a game of chance. In this post, we dive into the root cause: an industry-wide phenomenon created by the interplay of Sparse Mixture of Experts (MoE) architecture and the batched inference common to provider APIs. By implementing a "cohort of judges," a small panel of LLMs with semantically similar but varied prompts, we successfully reduce this variance by at least 50%. This creates the stable, trustworthy measurement foundation needed to confidently build and improve AI agents.
Read more
September 3, 2025
Toolchaining: The Problem No One is Talking About
We found the standard approach to toolchaining insufficient. Simply giving the LLM access to multiple tools and asking it to clean the data led to extremely poor results, if it could do it at all. Instead, when we gave the LLM access to a python sandbox pre-loaded with these tools and asked it to develop the aforementioned plan, the output was significantly improved. For the remainder of this blog, we dive into why this happened, how we set up our experiment, and what the findings mean for you.
Read more
August 13, 2025
Training the Next Generation of Enterprise Agents: Scale's Research in Reinforcement Learning
Language models provide impressive general capabilities off the shelf, but because they were not trained on private enterprise data they fall short of delivering the specialized performance enterprises need for their unique workflows, internal systems, and proprietary data. At Scale, our Safety, Evaluations and Alignment Lab (SEAL) recently published foundational reinforcement learning research, and our enterprise team is pioneering new approaches to solve this challenge through cutting-edge reinforcement learning research focused on training AI agents specifically for enterprise environments.
Read more
December 5, 2024
Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for Fine-Tuning LLMs
The Scale AI team investigates various synthetic data generation strategies for fine-tuning LLMs under different constraints, providing a robust framework for enterprises to evaluate the most optimal and cost-effective data strategies for fine-tuning LLMs.
Read more