Scale AI Research
Scale AI’s mission is to accelerate the development of AI applications. By advancing research, we aim to create AI systems capable of solving complex, human-level problems.


Assessing Robustness to Spurious Correlations in Post-Training Language Models
May 9, 2025
Post-Training
Science of Data
Read More

Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria Reranking
March 14, 2025
Reasoning
Read More

Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language Models
March 8, 2025
Safety, Evaluation and Alignment
Read More

The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
March 5, 2025
Safety, Evaluation and Alignment
Read More

ENIGMAEVAL: A Benchmark of Long Multimodal Reasoning Challenges
February 13, 2025
Reasoning
Safety, Evaluation and Alignment
Read More

J2: Jailbreaking to Jailbreak
February 11, 2025
Safety, Evaluation and Alignment
Read More

ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms
February 10, 2025
Safety, Evaluation and Alignment
Read More

MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs
January 29, 2025
Safety, Evaluation and Alignment
Reasoning
Read More

Humanity's Last Exam
January 23, 2025
Safety, Evaluation and Alignment
Reasoning
Read More