Model
Score
95% CI
1
Gemini 2.5 Pro Experimental (March 2025)
54.65
+1.46 / -1.46
2
Claude 3.7 Sonnet Thinking (February 2025)
48.23
+0.70 / -0.70
3
Gemini 2.0 Flash Thinking Experimental (January 2025)
45.50
+1.20 / -1.20
3
o1 (December 2024)
45.25
+0.40 / -0.40
4
Gemini 2.0 Pro Experimental (February 2025)
43.25
+1.26 / -1.26
5
Claude 3.7 Sonnet (February 2025)
43.02
+1.14 / -1.14
5
GPT-4.5 Preview (February 2025)
42.11
+1.39 / -1.39
7
Gemini 2.0 Flash Experimental (December 2024)
39.95
+0.80 / -0.80
8
Gemini 2.0 Flash (February 2025)
39.85
+0.71 / -0.71
8
Claude 3.5 Sonnet (October 2024)
38.72
+0.51 / -0.51
10
Claude 3.5 Sonnet (June 2024)
38.37
+0.70 / -0.70
10
ChatGPT-4o-latest (November 2024)
37.99
+0.48 / -0.48
10
Gemini 1.5 Pro
37.07
+1.34 / -1.34
14
GPT-4o (August 2024)
34.94
+0.23 / -0.23
14
Gemini 1.5 Flash 002
34.03
+1.41 / -1.41
15
Pixtral Large (November 2024)
33.89
+0.69 / -0.69
15
Gemini 2.0 Flash Lite Preview (February 2025)
32.69
+1.40 / -1.40
18
Qwen2-VL-72B-Instruct
28.56
+1.37 / -1.37
18
Claude 3 Opus
27.82
+0.55 / -0.55
20
Nova Pro
26.27
+0.61 / -0.61
20
Pixtral 12B (September 2024)
25.97
+0.74 / -0.74
20
Nova Lite
25.50
+0.77 / -0.77
21
Llama 3.2 90B Vision Instruct
24.61
+0.80 / -0.80
24
Llama 3.2 11B Vision-Instruct
20.47
+0.15 / -0.15
25
Phi 3.5 Vision-Instruct
15.18
+0.81 / -0.81
Rank (UB): 1 + the number of models whose lower CI bound exceeds this model’s upper CI bound.
-
o1-pro coming soon.