Research
Smoothing Out LLM Variance for Reliable Enterprise Evals

A critical challenge in enterprise AI development is the instability of LLM evaluations. Our internal testing revealed that metrics on identical A/B tests can swing by as much as 15% from one day to the next. This level of variance is large enough to invalidate results, making principled, incremental improvement a game of chance. In this post, we dive into the root cause: an industry-wide phenomenon created by the interplay of Sparse Mixture of Experts (MoE) architecture and the batched inference common to provider APIs. By implementing a "cohort of judges," a small panel of LLMs with semantically similar but varied prompts, we successfully reduce this variance by at least 50%. This creates the stable, trustworthy measurement foundation needed to confidently build and improve AI agents.
Read more