Model
Score
95% CI
1
Gemini 2.5 Pro Experimental (March 2025)
51.91
+0.99 / -0.99
1
Claude 3.7 Sonnet Thinking (February 2025)
51.58
+1.98 / -1.98
3
o1 (December 2024)
44.93
+3.29 / -3.29
3
GPT-4.5 Preview (February 2025)
43.77
+1.60 / -1.60
3
Claude 3.5 Sonnet (October 2024)
43.20
+3.07 / -3.07
3
Claude 3.7 Sonnet (February 2025)
42.89
+2.25 / -2.25
3
o3-mini (medium)
40.09
+2.89 / -2.89
3
o3-mini (high)
39.89
+2.64 / -2.64
4
Gemini 2.0 Pro Experimental (February 2025)
40.67
+1.32 / -1.32
5
Gemini 2.0 Flash Thinking Experimental (January 2025)
37.78
+3.67 / -3.67
5
Gemini 2.0 Flash (February 2025)
36.88
+4.25 / -4.25
8
o1-preview
37.28
+0.69 / -0.69
11
o1-mini
34.49
+1.43 / -1.43
11
Gemini 2.0 Flash Experimental (December 2024)
33.51
+2.84 / -2.84
12
DeepSeek R1
32.01
+1.40 / -1.40
16
GPT-4o (November 2024)
27.81
+1.44 / -1.44
16
GPT-4 (November 2024)
25.22
+2.29 / -2.29
17
Llama 3.3 70B Instruct
24.84
+0.55 / -0.55
17
Nova Pro
20.73
+3.64 / -3.64
18
Gemini 1.5 Pro Experimental (August 2024)
21.59
+2.60 / -2.60
19
Qwen 2 72B Instruct
19.99
+2.84 / -2.84
19
Qwen 2.5 14B Instruct
18.34
+1.06 / -1.06
20
Qwen 2.5 72B Instruct
17.34
+0.74 / -0.74
20
Llama 3.2 3B Instruct
17.00
+1.87 / -1.87
24
Llama 3.1 405B Instruct
16.22
+0.34 / -0.34
24
Mistral Large 2
15.23
+1.04 / -1.04
25
GPT-4o (August 2024)
12.16
+3.52 / -3.52
27
Mixtral 8x7B Instruct v0.1
11.92
+1.67 / -1.67
Rank (UB): 1 + the number of models whose lower CI bound exceeds this model’s upper CI bound.
-
o1-pro coming soon.