Company Updates & Technology Articles
September 16, 2025
Kaitlin Elliott, who leads firmwide Generative AI Solutions at Morgan Stanley, joined us in the studio to unpack how AI evaluations powered the firm’s successful adoption of production GenAI. This is a real world case study you don't want to miss.
September 15, 2025
A critical challenge in enterprise AI development is the instability of LLM evaluations. Our internal testing revealed that metrics on identical A/B tests can swing by as much as 15% from one day to the next. This level of variance is large enough to invalidate results, making principled, incremental improvement a game of chance. In this post, we dive into the root cause: an industry-wide phenomenon created by the interplay of Sparse Mixture of Experts (MoE) architecture and the batched inference common to provider APIs. By implementing a "cohort of judges," a small panel of LLMs with semantically similar but varied prompts, we successfully reduce this variance by at least 50%. This creates the stable, trustworthy measurement foundation needed to confidently build and improve AI agents.
September 12, 2025
Can an AI be a great tutor? TutorBench is a new, challenging benchmark from Scale designed to find out. Moving beyond right or wrong answers, it grades today's leading AI models on their ability to actually teach: evaluating crucial skills like adaptive explanation, constructive feedback, and active learning support. Using 1,500 multimodal conversations across STEM subjects, many including images of handwritten work, TutorBench reveals that even the most advanced models still have a long way to go to master the nuanced art of tutoring, paving the way for the next generation of AI in education.
September 4, 2025
Scale is committed to building a brighter, stronger future for America by improving the AI literacy level of students and teachers across America. We believe AI can be a tool for creativity, problem solving, and discovery, whether that means addressing local challenges, sparking curiosity in the classroom, or opening doors to future opportunities. That’s why today, we are proud to share Scale’s commitment to advancing AI literacy and expanding access to AI learning for educators and students nationwide.
September 3, 2025
We found the standard approach to toolchaining insufficient. Simply giving the LLM access to multiple tools and asking it to clean the data led to extremely poor results, if it could do it at all. Instead, when we gave the LLM access to a python sandbox pre-loaded with these tools and asked it to develop the aforementioned plan, the output was significantly improved. For the remainder of this blog, we dive into why this happened, how we set up our experiment, and what the findings mean for you.
September 2, 2025
How do you know if an AI model is actually learning, or just getting better at faking it? A new paper from researchers at Scale introduces Rubrics as Rewards (RaR), a framework that solves this problem by training models with structured, expert-designed checklists instead of simple preference scores. This approach moves the human role from a simple preference labeler to an expert architect of the AI's values, resulting in up to a 28% performance leap on challenging benchmarks and providing a more transparent, effective path toward reliable AI.
August 25, 2025
Scale AI has been awarded a $99 million contract by the U.S. Department of Defense to accelerate Army research and development in artificial intelligence. Building on its expanding partnership with the Pentagon, Scale will deliver data operations, platforms, and engineering support to help the Army adopt AI across critical missions.
August 21, 2025
In today’s episode Scale’s enterprise team (Clemens Viernickel, Mark Pfeiffer, Sam Denton, and Felix Su) review several viral AI agent demos from the internet and decide how realistic they are in their current form to be deployed in an enterprise environment. What do you think of their votes?
August 19, 2025
AI is moving beyond text, toward agents that can listen, speak, and interact naturally with the world. Voice AI requires far more than word; it demands the nuanced tones, emotions, and dynamics of human speech. But unlike text, there’s no vast public library of labeled audio to train on. Scale is building that foundation, delivering high-quality, diverse, and emotionally rich speech data to power every stage of model development. From real-time conversation to multimodal perception, these datasets are unlocking the next era of human-computer interaction. The future is listening.
August 18, 2025
Agentic AI is the most exciting breakthrough Scale is pioneering. As a world leader in Agentic AI, Scale is putting AI agents to work that operate autonomously and are tireless in their ability to perform complex tasks.