MultiChallenge
Realistic multi-turn conversation
Performance Comparison
59.09±1.08
56.51±1.82
51.91±0.99
51.58±1.98
49.91±1.89
49.82±1.36
47.65±2.41
o1 (December 2024)
44.93±3.29
43.83±4.71
43.77±1.60
Claude 3.5 Sonnet (October 2024)
43.20±3.07
42.99±3.17
42.89±2.25
40.53±1.72
40.09±2.89
39.89±2.64
38.26±3.97
40.67±1.32
Gemini 2.0 Flash Thinking Experimental (January 2025)
37.78±3.67
36.88±4.25
o1-preview
37.28±0.69
35.81±2.50
o1-mini
34.49±1.43
Gemini 2.0 Flash Experimental (December 2024)
33.51±2.84
32.19±3.18
32.06±0.70
32.01±1.40
GPT-4o (November 2024)
27.81±1.44
GPT-4 (November 2024)
25.22±2.29
Llama 3.3 70B Instruct
24.84±0.55
20.73±3.64
Gemini 1.5 Pro Experimental (August 2024)
21.59±2.60
20.30±1.40
Qwen 2 72B Instruct
19.99±2.84
Qwen 2.5 14B Instruct
18.34±1.06
Qwen 2.5 72B Instruct
17.34±0.74
Llama 3.2 3B Instruct
17.00±1.87
15.04±2.20
Llama 3.1 405B Instruct
16.22±0.34
Mistral Large 2
15.23±1.04
GPT-4o (August 2024)
12.16±3.52
Mixtral 8x7B Instruct v0.1
11.92±1.67