Scale AI logo
SEAL Logo

MultiChallenge

Realistic multi-turn conversation

Last updated: March 29, 2025

Performance Comparison

1

63.77±1.53

3

59.09±1.08

3

56.51±1.82

3

53.90±0.84

3

53.12±4.40

4

52.62±1.53

4

51.58±1.98

5

51.91±0.99

5

51.62±1.35

5

49.91±1.89

5

49.82±1.36

5

49.63±2.56

5

49.37±2.54

8

47.65±2.41

10

o1 (December 2024)

44.93±3.29

10

43.83±4.71

14

44.55±0.86

14

43.77±1.60

14

Claude 3.5 Sonnet (October 2024)

43.20±3.07

14

42.99±3.17

15

42.89±2.25

17

40.53±1.72

17

40.09±2.89

17

39.89±2.64

17

38.26±3.97

18

40.67±1.32

19

Gemini 2.0 Flash Thinking Experimental (January 2025)

37.78±3.67

19

36.88±4.25

24

o1-preview

37.28±0.69

24

35.81±2.50

27

o1-mini

34.49±1.43

27

Gemini 2.0 Flash Experimental (December 2024)

33.51±2.84

27

32.19±3.18

29

32.29±1.61

29

32.01±1.40

31

32.06±0.70

36

GPT-4o (November 2024)

27.81±1.44

37

26.65±0.42

37

GPT-4 (November 2024)

25.22±2.29

39

Llama 3.3 70B Instruct

24.84±0.55

39

20.73±3.64

40

Gemini 1.5 Pro Experimental (August 2024)

21.59±2.60

41

20.30±1.40

41

Qwen 2 72B Instruct

19.99±2.84

41

Qwen 2.5 14B Instruct

18.34±1.06

43

Qwen 2.5 72B Instruct

17.34±0.74

43

Llama 3.2 3B Instruct

17.00±1.87

44

15.04±2.20

47

Llama 3.1 405B Instruct

16.22±0.34

47

Mistral Large 2

15.23±1.04

48

GPT-4o (August 2024)

12.16±3.52

50

Mixtral 8x7B Instruct v0.1

11.92±1.67