*
Model
Score
95% CI
1
o1-mini
1237
+31 / -28
2
o3-mini
1137
+34 / -34
2
GPT-4o (November 2024)
1132
+31 / -32
2
o1-preview
1123
+35 / -32
2
Gemini 2.0 Flash Experimental (December 2024)
1111
+26 / -26
2
Gemini 2.0 Pro (December 2024)
1109
+33 / -35
2
Gemini 2.0 Flash Thinking (January 2025)
1108
+37 / -31
2
DeepSeek R1
1100
+31 / -31
2
o1 (December 2024)
1083
+30 / -30
4
Claude 3.5 Sonnet (June 2024)
1079
+22 / -22
8
GPT-4o (August 2024)
1045
+25 / -24
9
GPT-4o (May 2024)
1036
+24 / -25
10
GPT-4 Turbo Preview
1034
+22 / -22
11
Mistral Large 2
1029
+23 / -24
11
Llama 3.1 405B Instruct
1022
+24 / -23
11
Gemini 1.5 Pro (August 27, 2024)
1007
+26 / -27
12
Gemini 1.5 Pro (May 2024)
994
+25 / -25
12
GPT-4 (November 2024)
992
+28 / -30
12
Llama 3.2 90B Vision Instruct
984
+30 / -32
14
Deepseek V3
985
+25 / -24
16
Claude 3 Opus
959
+26 / -23
17
Gemini 1.5 Flash
943
+26 / -26
22
Gemini 1.5 Pro (April 2024)
891
+32 / -28
23
Claude 3 Sonnet
879
+31 / -29
23
Llama 3 70B Instruct
871
+26 / -26
26
Mistral Large
811
+25 / -27
27
Gemini 1.0 Pro
685
+34 / -34
28
CodeLlama 34B Instruct
598
+38 / -38
* Ranking is based on Rank(UB)