Model
Score
95% CI
1
o1 (December 2024)
1165
+30 / -30
2
o3-mini
1156
+32 / -32
3
o1-preview
1120
+44 / -40
4
Gemini 1.5 Pro (August 27, 2024)
1120
+35 / -35
5
Gemini 2.0 Pro (December 2024)
1117
+33 / -32
6
Gemini Pro Flash 2
1115
+28 / -28
7
Gemini 1.5 Pro (November 2024)
1077
+28 / -28
8
Gemini 2.0 Flash Thinking (January 2025)
1060
+33 / -30
9
DeepSeek R1
1052
+32 / -28
10
Deepseek V3
1031
+26 / -25
11
GPT-4o (August 2024)
1029
+30 / -30
12
Gemini 1.5 Flash
1015
+53 / -57
12
Aya Expanse 32B
967
+29 / -30
13
Mistral Large 2
1006
+34 / -34
14
DeepSeek V2 Chat
996
+24 / -25
15
GPT-4 (November 2024)
985
+28 / -31
17
Gemma 2 27B
966
+29 / -26
18
Claude 3.5 Sonnet (June 2024)
930
+42 / -50
19
Qwen 2 72B Instruct
902
+37 / -39
20
Llama 3.3 70B Instruct
883
+33 / -32
21
Yi 1.5 34B Chat
780
+41 / -44
22
Llama 3.1 405B Instruct
768
+54 / -58
23
Aya 23 35B*
761
+30 / -30