<- Back to leaderboard
MultiChallenge
Model
Score
95% Confidence
1st
o1 (December 2024)
44.93
+3.29 / -3.29
2nd
Claude 3.5 Sonnet (October 2024)
43.20
+3.07 / -3.07
3rd
Gemini 2.0 Pro Experimental (February 2025)
40.67
+1.32 / -1.32
4
o3-mini (medium)
40.09
+2.89 / -2.89
5
Gemini 2.0 Flash Thinking Experimental (January 2025)
37.78
+3.67 / -3.67
6
o1-preview
37.28
+0.69 / -0.69
7
Gemini 2.0 Flash (February 2025)
36.88
+4.25 / -4.25
8
o1-mini
34.49
+1.43 / -1.43
9
Gemini 2.0 Flash Experimental (December 2024)
33.51
+2.84 / -2.84
10
DeepSeek R1
32.01
+1.40 / -1.40
11
Llama 3.3 70B Instruct
24.84
+0.55 / -0.55
12
Gemini 1.5 Pro (August 27, 2024)
21.59
+2.60 / -2.60
13
Qwen 2 72B Instruct
19.99
+2.84 / -2.84
14
Qwen 2.5 14B Instruct
18.34
+1.06 / -1.06
15
Qwen 2.5 72B Instruct
17.34
+0.74 / -0.74
16
Llama 3.2 3B Instruct
17.00
+1.87 / -1.87
17
Llama 3.1 405B Instruct
16.22
+0.34 / -0.34
18
Mistral Large 2
15.23
+1.04 / -1.04
19
GPT-4o (August 2024)
12.16
+3.52 / -3.52
20
Mixtral 8x7B Instruct v0.1
11.92
+1.67 / -1.67