Today, we’re announcing the expansion of our research mandate with the launch of Scale Labs. Scale’s Safety, Evaluation, and Alignment Lab (SEAL), launched in 2023, cemented Scale’s authority in benchmarking frontier AI systems, with leading labs citing SEAL benchmarks across major model releases and system cards. Extending SEAL’s work, Scale Labs is our new hub for research initiatives spanning model capability, agentic and multimodal systems, post-training and evaluation methods, enterprise deployment, and our work with global governments and national research institutes to further enable their adoption of AI.
Scale is uniquely positioned to expand our research efforts. Through collaborations across research, enterprise, and government, we see how advanced models are built, evaluated, and used in practice. Our history, along with operating the industry’s largest human evaluation infrastructure, gives us rare visibility into how AI systems behave both under rigorous evaluation and in real production environments. As AI advances, so does our perspective.
Scale Labs studies how advanced AI systems actually behave as they become more powerful and deploy in real-world settings. We look at how to test, improve, and evaluate these systems so we understand how they behave in complex workflows, under pressure, and in high-stakes environments. We build better ways of measuring capability, reliability, and risk so that companies, governments, and national research institutes can use these systems with clearer expectations and stronger oversight. Our primary areas of focus include:
Together, these areas center on understanding how advanced systems behave outside the lab, helping Scale and our partners to build the infrastructure to measure and oversee them responsibly.
Recent launches include:
Previous benchmarks developed under SEAL were designed to remain informative as frontier performance rises. These include Humanity’s Last Exam (HLE), MCP-Atlas, and SWE-Bench Pro, now recommended by OpenAI in place of SWE-Bench Verified. We are also extending benchmarks such as FORTRESS for use in national and public-sector contexts, expanding their role in AI safety evaluation.
The next phase of AI will be defined by how AI systems behave in the real world, in real-world workflows, institutions, and decision-making processes. Scale Labs exists to help the field and national governments understand and shape that behavior. Our work is already appearing across new research releases and academic venues, including recent papers accepted to ICLR exploring rubric-based rewards, evaluation-driven reinforcement learning, and agentic safety measurement.
Explore our research page and ongoing commentary on our blog and on X.