Loading...

Scale AI logo

Enterprise
Government

Hello Dwarkesh Listeners

Like you, we have an unwavering belief that AI will fundamentally transform our world. See how we’re achieving this and come join us in building the future.

Frontier AI Research

SEAL (Safety, Evaluations, and Alignment Lab) is our research initiative to improve model capabilities through challenging private evaluations and novel research.

Humanity's Last Exam.png

Research

Humanity's Last Exam

Leaderboards

SEAL Leaderboards: Expert-Driven Private Evaluations

Browser Art.png

Research

LLMs Easily Jailbroken as Browser Agents

Research

Multi-Turn Human Jailbreaks on LLM Defenses

Research

Examination of LLM Performance on Grade School Arithmetic

Research

Goal-Conditioned Representations for Reward Models

Research

Measuring and Reducing Malicious Use With Unlearning

Product Overview

Scale AI products are integral to boosting AI model performance. We combine frontier AI research with our expert talent network to help leading AI organizations build better, more capable AI systems.

Data Engine

Advance frontier model capabilities like agentic tool use and complex reasoning with SFT and RLHF data.

Evaluation

Identify loss buckets for model performance to pinpoint areas for improvement.

GenAI Platform

Build, test, and optimize Generative AI applications that unlock the value of your data.

Donovan

Leverage leading LLMs, fine-tuned models, and custom AI workflows for mission-critical use cases.

LLM Leaderboards

Expert-Led Private Evaluations for precise and reliable LLM rankings

SEAL’s mission is to build robust evaluation products that tackle the challenging research problems in LLM evaluation and red-teaming.

Join the Scale AI Research Team

Contribute to frontier AI research at Scale.

No open roles match. Try removing filters.

News

Research

Humanity's Last Exam

Blog

Scale's SEAL Research Lab Launches Expert-Evaluated LLM Leaderboards

Research

MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems

Research

MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs

Research

ENIGMAEVAL: A Benchmark of Long Multimodal Reasoning Challenges

Blog

Scale Public Sector: Building on Our Progress in 2025

Blog

The New ChatGPT-4o Update Promises Better Writing, How Does It Compare To The New Claude 3.5 Sonnet?

Blog

First Impressions of OpenAI’s o1

Blog

Advancing Safe and Reliable AI: Scale's Research in Post-Training, Reasoning, and Evaluation

The future of your industry starts here

Products
- Scale Data Engine
- Scale GenAI Platform
- Scale Donovan
- Government
  - Public Sector
Company
Resources
Guides
Follow Us

Copyright © 2025 Scale AI, Inc. All rights reserved.Terms of Use & Privacy Policy