Company Updates & Technology Articles
October 16, 2025
The world is buzzing about the potential of artificial intelligence. We've seen groups proclaiming a "$10 Trillion AI Revolution" and comparing its impact to the Industrial Revolution. In the last 6-12 months, we've watched as true Generative AI solutions have begun to move from the lab and into the enterprise.
Our new benchmark, VisualToolBench, reveals a striking limitation in today's most advanced AI: models are much better at "thinking about images" than "thinking with them." While AI can describe what's in a picture, it fails when asked to manipulate an image by cropping, editing, or enhancing it to solve complex, real-world problems. The results are stark, with no model scoring above 19% correctness. Dive into our breakdown of why even the best models fail, what this reveals about the core bottleneck of visual perception, and how these findings create a new roadmap for the future of AI.
October 7, 2025
Many enterprise problems lack simple yes/no solutions, causing common AI training methods to fall short. Scale’s Rubrics as Rewards (RaR) method solves this by using a detailed, multi-faceted rubric for evaluation instead of a simple reward signal. This approach enables smaller, fine-tuned models to match or outperform much larger, general-purpose models on specialized tasks. For instance, on a legal analysis test set, a small Qwen3-4B model trained with RaR surpassed the performance of the much larger GPT-4.1. For enterprises, this translates directly to lower costs, more transparency, and tighter control, delivering superior performance on the complex workflows that matter most.
September 24, 2025
Scale’s Data Engine for Physical AI is a comprehensive data collection and annotation solution that provides the massive, high-quality datasets robotics companies need to train foundation models.
September 22, 2025
SEAL Showdown is a new public AI leaderboard from Scale that evaluates large language models based on real-world user preferences rather than synthetic tests or hobbyist feedback. Unlike existing leaderboards, it captures granular insights by demographics, regions, professions, and use cases, drawing on millions of conversations from a diverse global contributor base. Designed to be trustworthy and resistant to gaming, SEAL Showdown sets a new standard for model evaluation by showing how AI performs for people like you.
September 19, 2025
Benchmarks play a critical role in measuring the progress of AI coding agents, but most fall short by relying on contaminated training data, oversimplified bug fixes, or narrow task coverage. SWE-Bench Pro solves these problems with contamination-resistant repositories, diverse and industrially relevant codebases, and human-in-the-loop curation that preserves real-world difficulty. With reproducible, end-to-end evaluation, SWE-Bench Pro sets a new gold standard for testing advanced AI developers.
While today's agents show promise, the benchmarks used to evaluate them often test simple, isolated skills that don't reflect real-world work. To close this gap, Scale is launching a new suite of evaluations designed to measure an agent's ability to perform complex, end-to-end tasks. Our first two leaderboards set a new, more difficult standard for the industry. SWE-Bench Pro challenges agents with professional software engineering tasks in complex, proprietary codebases they've never seen before. MCP Atlas measures an agent's ability to skillfully orchestrate over 300 real-world digital tools to solve a single problem. Read the full post to learn about our framework for building a more reliable yardstick for the future of AI.
MCP-Atlas is a real-world leaderboard for agentic tool use via the Model Context Protocol. It runs 1,000 single-turn tasks across 40+ servers and 300+ tools—search, databases, filesystems, APIs, and dev tools—each requiring 3–6 calls with distractors. We score exact-answer pass rate and provide diagnostics. Early results: even the top model completes less than half of tasks, with failures concentrated in tool selection, parameter construction, and orchestration. Built for model and product teams, MCP-Atlas pinpoints what to fix.
The future of artificial intelligence will be shaped by those who build it together. Right now, there is no more important partnership in technology than the one being forged between the United States and the United Kingdom. This transatlantic alliance, strengthened by a historic bilateral technology agreement, is creating a center of gravity for AI innovation, and at Scale AI, we are proud to be at the heart of it.
September 17, 2025
This agreement, known as an Other Transaction Authority (OTA), is designed specifically to help the DoD move at speed and partner with non-traditional tech companies like Scale. It streamlines the procurement process, allowing any component across the entire DoD to access our end-to-end AI platform.