Scale AI Research
Scale AI’s mission is to accelerate the development of AI applications. By advancing research, we aim to create AI systems capable of solving complex, human-level problems.


Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language Models
March 8, 2025
Safety, Evaluation and Alignment
Read More

The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
March 5, 2025
Safety, Evaluation and Alignment
Read More

ENIGMAEVAL: A Benchmark of Long Multimodal Reasoning Challenges
February 13, 2025
Reasoning
Safety, Evaluation and Alignment
Read More

J2: Jailbreaking to Jailbreak
February 11, 2025
Safety, Evaluation and Alignment
Read More

ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms
February 10, 2025
Safety, Evaluation and Alignment
Read More

MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs
January 29, 2025
Safety, Evaluation and Alignment
Reasoning
Read More

Humanity's Last Exam
January 23, 2025
Safety, Evaluation and Alignment
Reasoning
Read More

ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark
January 2, 2025
Safety, Evaluation and Alignment
Reasoning
Oversight
Read More

Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
October 11, 2024
Safety, Evaluation and Alignment
Read More