Smoothing Out LLM Variance for Reliable Enterprise Evals

In order to create and maintain the best AI agents available, enterprises need to be constantly evaluating and improving them. Doing this manually is both slow and expensive, so at Scale we solve this problem using LLMs as judges. The way we do this is simple: we use the LLM judges to compare two agents that are identical except for one small variable. This allows us to reliably determine what actually causes a shift in behavior, which is the foundation of making principled, incremental improvements.

These evaluations, however, have a critical flaw: they are not repeatable over time. Our internal experiments revealed that the exact same test can produce metrics that change by as much as 10-15% from one day to the next. This level of instability makes it impossible to trust the results, turning a principled process into a game of chance. In this post we’ll share the data that shows how this is an industry-wide problem, explain its root cause, and detail our practical “cohort of judges” method for solving it.

This foundation of reliable measurement enables all of our advanced agent improvement work, including our research in reinforcement learning.

An Industry-Wide Problem

Our investigation began with a project to evaluate a student assistant chatbot, where the aim was to assess its performance across key dimensions like avoiding direct answers, resisting jailbreak attempts, and staying on topic. When we used LLM Judges, the results quickly showed that the exact same evaluations were not repeatable.

On jailbreak resistance, the same model scored 77% one day and 63% the next.
On refusal fidelity, it swung from 71% to 81% across consecutive runs.

This variance occurred in our tests of models from OpenAI, Google, and Anthropic, with the following ranges of variance, suggesting a systemic challenge for current LLM provider APIs:

OpenAI (GPT-4 variants): ±10–12%
Anthropic (Claude variants): ±8–11%
Google (Gemini variants): ±9–14%

The variance in evaluation metrics had a margin of error large enough to invalidate any A/B test. This makes principled, incremental improvement nearly impossible. To achieve a 50% improvement in an agent's performance, a team might make ten small changes, each expected to contribute a ~5% gain. However, an A/B test cannot reliably detect a 5% signal when the day-to-day noise of the measurement tool is 10-15%. Positive changes can appear negative, and vice versa, making the development process inefficient and unreliable.

Why Might this Happen?

There are a few potential reasons why this happens, some of them simple, others a bit more complicated. The simpler reason is because provider APIs have constantly shifting components; in other words, builders are always tweaking their models, so you’re likely using a model with slightly different components today than you were yesterday. This is a key factor for next-generation frontier models as well, as users often lack the fine-grained control to select a specific, static model version.

The deeper technical reason is a consequence of the architecture used by most frontier LLMs and involves two key concepts: Sparse Mixture of Experts (MoE) and batched inference.

Sparse Mixture of Experts (MoE): Instead of being one massive, monolithic network, models tend to be composed of many smaller, specialized "expert" sub-networks. For any given input, the model dynamically routes the request through only a fraction of these sub-networks.
Batched Inference: To operate efficiently, providers process many user requests simultaneously in a "batch."

When these two techniques are combined, the results become unpredictable because the composition of a batch can determine which expert your query gets routed to. The mix of queries in a batch is not deterministic. For example, your query might be a math question, but if most other queries from other users in the same batch are related to psychology, your math query could be incorrectly routed to a psychology expert. This means the specific path your request takes through the model is influenced by the other requests being processed alongside it, making the output consistent only at the batch-level, not for your individual query.

For a more detailed analysis of this phenomenon, we recommend this technical breakdown.

The Solution: A Cohort of Judges for Improved Reliability

Instead of relying on a single, noisy judge, we use a panel of three, which we call the “cohort of judges.” Each judge is given a slightly different prompt for the same task that is semantically the same. By aggregating the outputs from this cohort, we successfully smooth out the provider-side variance and produce a stable, repeatable evaluation metric. With this method, the variance in our evaluation results was reduced by at least 50% across the different dimensions we tested. This allows us to distinguish true performance changes from statistical noise, enabling us to trust our A/B test results and make principled, incremental improvements with confidence.

The Requirement for Reliable Evaluation

The underlying models we build on are not as fundamentally stable as one might imagine. This variance is a structural, industry-wide challenge that can introduce uncertainty into the evaluation process, creating risks for product development and roadmap planning. By adopting a cohort-of-judges approach, we can produce more reliable, repeatable measurements needed for decision-making. This practical step moves organizations toward a more robust MLOps practice, ensuring that the hard work of improving agents is measured on a solid foundation of trustworthy data.

An Industry-Wide Problem

On jailbreak resistance, the same model scored 77% one day and 63% the next.

On refusal fidelity, it swung from 71% to 81% across consecutive runs.

This variance occurred in our tests of models from OpenAI, Google, and Anthropic, with the following ranges of variance, suggesting a systemic challenge for current LLM provider APIs:

OpenAI (GPT-4 variants): ±10–12%

Anthropic (Claude variants): ±8–11%

Google (Gemini variants): ±9–14%

Why Might this Happen?

The deeper technical reason is a consequence of the architecture used by most frontier LLMs and involves two key concepts: Sparse Mixture of Experts (MoE) and batched inference.

Sparse Mixture of Experts (MoE): Instead of being one massive, monolithic network, models tend to be composed of many smaller, specialized "expert" sub-networks. For any given input, the model dynamically routes the request through only a fraction of these sub-networks.

Batched Inference: To operate efficiently, providers process many user requests simultaneously in a "batch."

For a more detailed analysis of this phenomenon, we recommend this technical breakdown.

The Solution: A Cohort of Judges for Improved Reliability

The Requirement for Reliable Evaluation

Smoothing Out LLM Variance for Reliable Enterprise Evals

An Industry-Wide Problem

Why Might this Happen?

The Solution: A Cohort of Judges for Improved Reliability

The Requirement for Reliable Evaluation

The future of your industry starts here

Smoothing Out LLM Variance for Reliable Enterprise Evals

An Industry-Wide Problem

Why Might this Happen?

The Solution: A Cohort of Judges for Improved Reliability

The Requirement for Reliable Evaluation

The future of your industry starts here