Scale AI logo
SEAL Logo

MultiChallenge

Realistic multi-turn conversation

Last updated: March 29, 2025

Performance Comparison

1

59.09±1.08

2

56.51±1.82

2

53.90±0.84

2

53.12±4.40

3

52.62±1.53

3

51.58±1.98

4

51.91±0.99

4

49.91±1.89

4

49.82±1.36

4

49.63±2.56

4

49.37±2.54

6

47.65±2.41

8

o1 (December 2024)

44.93±3.29

8

43.83±4.71

12

43.77±1.60

12

Claude 3.5 Sonnet (October 2024)

43.20±3.07

12

42.99±3.17

13

42.89±2.25

14

40.53±1.72

14

40.09±2.89

14

39.89±2.64

14

38.26±3.97

15

40.67±1.32

16

Gemini 2.0 Flash Thinking Experimental (January 2025)

37.78±3.67

16

36.88±4.25

21

o1-preview

37.28±0.69

21

35.81±2.50

24

o1-mini

34.49±1.43

24

Gemini 2.0 Flash Experimental (December 2024)

33.51±2.84

24

32.19±3.18

26

32.29±1.61

26

32.01±1.40

28

32.06±0.70

33

GPT-4o (November 2024)

27.81±1.44

34

GPT-4 (November 2024)

25.22±2.29

35

Llama 3.3 70B Instruct

24.84±0.55

35

20.73±3.64

36

Gemini 1.5 Pro Experimental (August 2024)

21.59±2.60

37

20.30±1.40

37

Qwen 2 72B Instruct

19.99±2.84

37

Qwen 2.5 14B Instruct

18.34±1.06

39

Qwen 2.5 72B Instruct

17.34±0.74

39

Llama 3.2 3B Instruct

17.00±1.87

40

15.04±2.20

43

Llama 3.1 405B Instruct

16.22±0.34

43

Mistral Large 2

15.23±1.04

44

GPT-4o (August 2024)

12.16±3.52

46

Mixtral 8x7B Instruct v0.1

11.92±1.67