Scale AI logo
SEAL Logo

MultiChallenge

Realistic multi-turn conversation

Last updated: March 29, 2025

Performance Comparison

1

59.09±1.08

2

56.51±1.82

2

53.90±0.84

2

53.12±4.40

3

52.62±1.53

3

51.58±1.98

4

51.91±0.99

4

51.62±1.35

4

49.91±1.89

4

49.82±1.36

4

49.63±2.56

4

49.37±2.54

7

47.65±2.41

9

o1 (December 2024)

44.93±3.29

9

43.83±4.71

13

44.55±0.86

13

43.77±1.60

13

Claude 3.5 Sonnet (October 2024)

43.20±3.07

13

42.99±3.17

14

42.89±2.25

16

40.53±1.72

16

40.09±2.89

16

39.89±2.64

16

38.26±3.97

17

40.67±1.32

18

Gemini 2.0 Flash Thinking Experimental (January 2025)

37.78±3.67

18

36.88±4.25

23

o1-preview

37.28±0.69

23

35.81±2.50

26

o1-mini

34.49±1.43

26

Gemini 2.0 Flash Experimental (December 2024)

33.51±2.84

26

32.19±3.18

28

32.29±1.61

28

32.01±1.40

30

32.06±0.70

35

GPT-4o (November 2024)

27.81±1.44

36

GPT-4 (November 2024)

25.22±2.29

37

Llama 3.3 70B Instruct

24.84±0.55

37

20.73±3.64

38

Gemini 1.5 Pro Experimental (August 2024)

21.59±2.60

39

20.30±1.40

39

Qwen 2 72B Instruct

19.99±2.84

39

Qwen 2.5 14B Instruct

18.34±1.06

41

Qwen 2.5 72B Instruct

17.34±0.74

41

Llama 3.2 3B Instruct

17.00±1.87

42

15.04±2.20

45

Llama 3.1 405B Instruct

16.22±0.34

45

Mistral Large 2

15.23±1.04

46

GPT-4o (August 2024)

12.16±3.52

48

Mixtral 8x7B Instruct v0.1

11.92±1.67