Frontier Leaderboards
Legacy Leaderboards
2025 Scale AI. All rights reserved.
Humanity's Last Exam (Text Only)
Models evaluated on text-only HLE questions
Last updated: April 16, 2025
Performance Comparison
1
20.57±1.71Calib Err: 36
1
19.78±1.68Calib Err: 37
1
18.90±1.65Calib Err: 58
1
18.38±1.64Calib Err: 71
5
14.53±1.49Calib Err: 58
5
13.37±1.44Calib Err: 80
5
12.58±1.40Calib Err: 81
7
10.31±1.28Calib Err: 81
8
DeepSeek R1
8.54±1.18Calib Err: 73
8
7.89±1.14Calib Err: 81
9
7.75±1.13Calib Err: 84
9
7.71±1.13Calib Err: 82
9
6.55±1.04Calib Err: 82
10
5.80±0.99Calib Err: 83
13
5.34±0.95Calib Err: 84
13
4.97±0.92Calib Err: 88
14
4.55±0.88Calib Err: 80
14
4.55±0.88Calib Err: 87
14
4.41±0.87Calib Err: 84
14
4.32±0.86Calib Err: 83
14
4.32±0.86Calib Err: 80
15
3.76±0.80Calib Err: 83
22
2.32±0.64Calib Err: 88