Scale Logo
SEAL Logo

MultiChallenge

Realistic multi-turn conversation

Last updated: March 29, 2025

Performance Comparison

1

51.91 ±0.99

1

51.58 ±1.98

1

49.82 ±1.36

3

o1 (December 2024)

44.93 ±3.29

4

43.77 ±1.60

4

Claude 3.5 Sonnet (October 2024)

43.20 ±3.07

4

42.89 ±2.25

4

40.09 ±2.89

4

39.89 ±2.64

4

38.26 ±3.97

5

40.67 ±1.32

6

Gemini 2.0 Flash Thinking Experimental (January 2025)

37.78 ±3.67

6

36.88 ±4.25

9

o1-preview

37.28 ±0.69

9

35.81 ±2.50

12

o1-mini

34.49 ±1.43

12

Gemini 2.0 Flash Experimental (December 2024)

33.51 ±2.84

12

32.19 ±3.18

14

32.06 ±0.70

16

32.01 ±1.40

20

GPT-4o (November 2024)

27.81 ±1.44

21

GPT-4 (November 2024)

25.22 ±2.29

22

Llama 3.3 70B Instruct

24.84 ±0.55

22

20.73 ±3.64

23

Gemini 1.5 Pro Experimental (August 2024)

21.59 ±2.60

24

20.30 ±1.40

24

Qwen 2 72B Instruct

19.99 ±2.84

24

Qwen 2.5 14B Instruct

18.34 ±1.06

26

Qwen 2.5 72B Instruct

17.34 ±0.74

26

Llama 3.2 3B Instruct

17.00 ±1.87

27

15.04 ±2.20

30

Llama 3.1 405B Instruct

16.22 ±0.34

30

Mistral Large 2

15.23 ±1.04

31

GPT-4o (August 2024)

12.16 ±3.52

33

Mixtral 8x7B Instruct v0.1

11.92 ±1.67