25.32±1.70
21.64±1.61
20.32±1.58
26.32±1.86
22.06±1.75
20.57±1.71
52.13±3.01
o3-pro-2025-06-10-high
49.00±3.02
o3-2025-04-16-high
45.50±3.00
63.77±1.53
58.55±3.03
59.09±1.08
12.96±2.34
14.79±2.49
16.01±2.68
Claude Sonnet 4 (Thinking)
95.33±2.29
94.20±1.79
89.27±2.01
13.09±1.92
11.91±1.85
10.47±1.74
54.65±1.46
54.63±0.55
51.79±0.63
We conduct high-complexity evaluations to expose model failures, prevent benchmark saturation, and push model capabilities —while continuously evaluating the latest frontier models.
Humans design complex evaluations and define precise criteria to assess models, while LLMs scale evaluations—ensuring efficiency and alignment with human judgment.
Our leaderboards are built on carefully curated evaluation sets, combining private datasets to prevent overfitting and open-source datasets for broad benchmarking and comparability.
If you'd like to add your model to this leaderboard or a future version, please contact leaderboards@scale.com. To ensure leaderboard integrity, we require that models can only be featured the FIRST TIME when an organization encounters the prompts.