Model
Score
95% Confidence
1st
o1 (December 2024)
91.96
+1.60 / -1.61
2nd
DeepSeek R1
87.75
+1.91 / -1.91
3rd
o1-preview
86.58
+1.58 / -1.57
4
Gemini 2.0 Flash Experimental (December 2024)
86.58
+1.83 / -1.83
5
Claude 3.5 Sonnet (June 2024)
85.96
+1.39 / -1.39
6
GPT-4o (May 2024)
85.29
+1.42 / -1.41
7
Llama 3.1 405B Instruct
84.85
+1.40 / -1.40
8
Gemini 1.5 Pro (August 27, 2024)
84.17
+1.65 / -1.65
9
GPT-4 Turbo Preview
83.19
+1.31 / -1.31
10
Mistral Large 2
82.81
+1.66 / -1.66
11
GPT-4o (November 2024)
82.52
+2.10 / -2.09
12
Deepseek V3
82.34
+2.08 / -2.09
13
Llama 3.2 90B Vision Instruct
82.07
+1.74 / -1.75
14
Llama 3 70B Instruct
81.17
+1.77 / -1.77
15
GPT-4o (August 2024)
80.17
+1.70 / -1.70
16
Claude 3 Opus
80.12
+1.54 / -1.55
17
Mistral Large
79.89
+1.67 / -1.66
18
GPT-4 (November 2024)
79.50
+1.92 / -1.93
19
Gemini 1.5 Pro (May 2024)
79.37
+1.70 / -1.69
20
Gemini 1.5 Pro (April 2024)
78.52
+2.33 / -2.32
21
Claude 3 Sonnet
78.24
+2.19 / -2.19
22
Gemini 1.5 Flash
77.25
+1.96 / -1.97
23
Gemini 1.0 Pro
67.97
+2.61 / -2.62
24
CodeLlama 34B Instruct
57.69
+2.58 / -2.57