Humanity's Last Exam (Text Only)
Models evaluated on text-only HLE questions
Performance Comparison
22.06±1.75Calib Err: 72
20.57±1.71Calib Err: 36
19.78±1.68Calib Err: 37
18.90±1.65Calib Err: 58
18.38±1.64Calib Err: 71
18.38±1.64Calib Err: 70
14.53±1.49Calib Err: 58
14.04±1.47Calib Err: 78
13.37±1.44Calib Err: 80
12.58±1.40Calib Err: 81
11.75±1.36Calib Err: 74
10.80±1.31Calib Err: 73
10.72±1.31Calib Err: 83
10.31±1.28Calib Err: 81
DeepSeek R1
8.54±1.18Calib Err: 73
7.89±1.14Calib Err: 81
7.75±1.13Calib Err: 84
7.71±1.13Calib Err: 82
7.60±1.12Calib Err: 76
6.55±1.04Calib Err: 82
6.26±1.02Calib Err: 75
5.80±0.99Calib Err: 83
5.42±0.96Calib Err: 76
5.34±0.95Calib Err: 84
4.97±0.92Calib Err: 88
4.55±0.88Calib Err: 80
4.55±0.88Calib Err: 87
4.41±0.87Calib Err: 84
4.36±0.86Calib Err: 78
4.32±0.86Calib Err: 83
4.32±0.86Calib Err: 80
3.76±0.80Calib Err: 83
2.32±0.64Calib Err: 88