Scale AI logo
SEAL Logo

Humanity's Last Exam (Text Only)

Models evaluated on text-only HLE questions

Last updated: April 16, 2025

Performance Comparison

1

22.06±1.75Calib Err: 72

1

20.57±1.71Calib Err: 36

1

19.78±1.68Calib Err: 37

1

18.90±1.65Calib Err: 58

2

18.38±1.64Calib Err: 71

2

18.38±1.64Calib Err: 70

7

14.53±1.49Calib Err: 58

7

14.04±1.47Calib Err: 78

7

13.37±1.44Calib Err: 80

7

12.58±1.40Calib Err: 81

7

11.75±1.36Calib Err: 74

9

10.80±1.31Calib Err: 73

9

10.72±1.31Calib Err: 83

10

10.31±1.28Calib Err: 81

12

DeepSeek R1

8.54±1.18Calib Err: 73

14

7.89±1.14Calib Err: 81

15

7.75±1.13Calib Err: 84

15

7.71±1.13Calib Err: 82

15

7.60±1.12Calib Err: 76

15

6.55±1.04Calib Err: 82

16

6.26±1.02Calib Err: 75

16

5.80±0.99Calib Err: 83

20

5.42±0.96Calib Err: 76

20

5.34±0.95Calib Err: 84

20

4.97±0.92Calib Err: 88

21

4.55±0.88Calib Err: 80

21

4.55±0.88Calib Err: 87

21

4.41±0.87Calib Err: 84

22

4.36±0.86Calib Err: 78

22

4.32±0.86Calib Err: 83

22

4.32±0.86Calib Err: 80

23

3.76±0.80Calib Err: 83

32

2.32±0.64Calib Err: 88