George Pu

5 articles

November 7, 2025

Research

Beyond "Out-of-the-Box": Why Enterprises Need Specialized RL Agents

Beyond "Out-of-the-Box": Why Enterprises Need Specialized RL Agents

While general-purpose AI models are powerful, they often fail to deliver on complex, specialized enterprise workflows that use private data. We share results from our real world work in the insurance and legal industries, highlighting how our RL-tuned agents outperformed leading LLMs and dive into how we achieved these performance gains.

Read more

September 15, 2025

Research

Smoothing Out LLM Variance for Reliable Enterprise Evals

Smoothing Out LLM Variance for Reliable Enterprise Evals

A critical challenge in enterprise AI development is the instability of LLM evaluations. Our internal testing revealed that metrics on identical A/B tests can swing by as much as 15% from one day to the next. This level of variance is large enough to invalidate results, making principled, incremental improvement a game of chance. In this post, we dive into the root cause: an industry-wide phenomenon created by the interplay of Sparse Mixture of Experts (MoE) architecture and the batched inference common to provider APIs. By implementing a "cohort of judges," a small panel of LLMs with semantically similar but varied prompts, we successfully reduce this variance by at least 50%. This creates the stable, trustworthy measurement foundation needed to confidently build and improve AI agents.

Read more

September 3, 2025

Research

Toolchaining: The Problem No One is Talking About

Toolchaining: The Problem No One is Talking About

We found the standard approach to toolchaining insufficient. Simply giving the LLM access to multiple tools and asking it to clean the data led to extremely poor results, if it could do it at all. Instead, when we gave the LLM access to a python sandbox pre-loaded with these tools and asked it to develop the aforementioned plan, the output was significantly improved. For the remainder of this blog, we dive into why this happened, how we set up our experiment, and what the findings mean for you.

Read more

August 13, 2025

Research

Training the Next Generation of Enterprise Agents: Scale's Research in Reinforcement Learning

Training the Next Generation of Enterprise Agents: Scale's Research in Reinforcement Learning

Language models provide impressive general capabilities off the shelf, but because they were not trained on private enterprise data they fall short of delivering the specialized performance enterprises need for their unique workflows, internal systems, and proprietary data. At Scale, our Safety, Evaluations and Alignment Lab (SEAL) recently published foundational reinforcement learning research, and our enterprise team is pioneering new approaches to solve this challenge through cutting-edge reinforcement learning research focused on training AI agents specifically for enterprise environments.

Read more

December 5, 2024

Engineering

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for Fine-Tuning LLMs

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for Fine-Tuning LLMs

The Scale AI team investigates various synthetic data generation strategies for fine-tuning LLMs under different constraints, providing a robust framework for enterprises to evaluate the most optimal and cost-effective data strategies for fine-tuning LLMs.

Read more