MultiChallenge
Realistic multi-turn conversation
Performance Comparison
51.91 ±0.99
51.58 ±1.98
49.82 ±1.36
o1 (December 2024)
44.93 ±3.29
43.77 ±1.60
Claude 3.5 Sonnet (October 2024)
43.20 ±3.07
42.89 ±2.25
40.09 ±2.89
39.89 ±2.64
38.26 ±3.97
40.67 ±1.32
Gemini 2.0 Flash Thinking Experimental (January 2025)
37.78 ±3.67
36.88 ±4.25
o1-preview
37.28 ±0.69
35.81 ±2.50
o1-mini
34.49 ±1.43
Gemini 2.0 Flash Experimental (December 2024)
33.51 ±2.84
32.19 ±3.18
32.06 ±0.70
32.01 ±1.40
GPT-4o (November 2024)
27.81 ±1.44
GPT-4 (November 2024)
25.22 ±2.29
Llama 3.3 70B Instruct
24.84 ±0.55
20.73 ±3.64
Gemini 1.5 Pro Experimental (August 2024)
21.59 ±2.60
20.30 ±1.40
Qwen 2 72B Instruct
19.99 ±2.84
Qwen 2.5 14B Instruct
18.34 ±1.06
Qwen 2.5 72B Instruct
17.34 ±0.74
Llama 3.2 3B Instruct
17.00 ±1.87
15.04 ±2.20
Llama 3.1 405B Instruct
16.22 ±0.34
Mistral Large 2
15.23 ±1.04
GPT-4o (August 2024)
12.16 ±3.52
Mixtral 8x7B Instruct v0.1
11.92 ±1.67