Model
Score
95% CI
1
Claude 3.7 Sonnet Thinking (February 2025)
48.23
+0.70 / -0.70
2
Gemini 2.0 Flash Thinking Experimental (January 2025)
45.50
+1.20 / -1.20
2
o1 (December 2024)
45.25
+0.40 / -0.40
3
Gemini 2.0 Pro Experimental (February 2025)
43.25
+1.26 / -1.26
4
Claude 3.7 Sonnet (February 2025)
43.02
+1.14 / -1.14
4
GPT-4.5 Preview (February 2025)
42.11
+1.39 / -1.39
6
Gemini 2.0 Flash Experimental (December 2024)
39.95
+0.80 / -0.80
7
Gemini 2.0 Flash (February 2025)
39.85
+0.71 / -0.71
7
Claude 3.5 Sonnet (October 2024)
38.72
+0.51 / -0.51
9
Claude 3.5 Sonnet (June 2024)
38.37
+0.70 / -0.70
9
ChatGPT-4o-latest (November 2024)
37.99
+0.48 / -0.48
9
Gemini 1.5 Pro
37.07
+1.34 / -1.34
13
GPT-4o (August 2024)
34.94
+0.23 / -0.23
13
Gemini 1.5 Flash 002
34.03
+1.41 / -1.41
14
Pixtral Large (November 2024)
33.89
+0.69 / -0.69
14
Gemini 2.0 Flash Lite Preview (February 2025)
32.69
+1.40 / -1.40
17
Qwen2-VL-72B-Instruct
28.56
+1.37 / -1.37
17
Claude 3 Opus
27.82
+0.55 / -0.55
19
Nova Pro
26.27
+0.61 / -0.61
19
Pixtral 12B (September 2024)
25.97
+0.74 / -0.74
19
Nova Lite
25.50
+0.77 / -0.77
20
Llama 3.2 90B Vision Instruct
24.61
+0.80 / -0.80
23
Llama 3.2 11B Vision-Instruct
20.47
+0.15 / -0.15
24
Phi 3.5 Vision-Instruct
15.18
+0.81 / -0.81
Rank (UB): 1 + the number of models whose lower CI bound exceeds this model’s upper CI bound.