Scale AI logo
SEAL Logo

MultiChallenge

Realistic multi-turn conversation

Last updated: March 29, 2025

Performance Comparison

1

59.09±1.08

2

56.51±1.82

3

51.91±0.99

3

51.58±1.98

3

49.91±1.89

3

49.82±1.36

4

47.65±2.41

5

o1 (December 2024)

44.93±3.29

5

43.83±4.71

7

43.77±1.60

7

Claude 3.5 Sonnet (October 2024)

43.20±3.07

7

42.99±3.17

8

42.89±2.25

9

40.53±1.72

9

40.09±2.89

9

39.89±2.64

9

38.26±3.97

10

40.67±1.32

11

Gemini 2.0 Flash Thinking Experimental (January 2025)

37.78±3.67

11

36.88±4.25

16

o1-preview

37.28±0.69

16

35.81±2.50

19

o1-mini

34.49±1.43

19

Gemini 2.0 Flash Experimental (December 2024)

33.51±2.84

19

32.19±3.18

21

32.06±0.70

23

32.01±1.40

27

GPT-4o (November 2024)

27.81±1.44

28

GPT-4 (November 2024)

25.22±2.29

29

Llama 3.3 70B Instruct

24.84±0.55

29

20.73±3.64

30

Gemini 1.5 Pro Experimental (August 2024)

21.59±2.60

31

20.30±1.40

31

Qwen 2 72B Instruct

19.99±2.84

31

Qwen 2.5 14B Instruct

18.34±1.06

33

Qwen 2.5 72B Instruct

17.34±0.74

33

Llama 3.2 3B Instruct

17.00±1.87

34

15.04±2.20

37

Llama 3.1 405B Instruct

16.22±0.34

37

Mistral Large 2

15.23±1.04

38

GPT-4o (August 2024)

12.16±3.52

40

Mixtral 8x7B Instruct v0.1

11.92±1.67