Scale AI logo
SEAL Logo

MultiChallenge

Dataset Introduction

Conducting multi-turn conversations with human users is a common yet challenging task for LLMs. It is challenging because conducting multi-turn conversations requires not only accurate instruction following, but also careful attention allocation of conversation context, and more importantly, near or surpassing human in-context reasoning capability at the same time. Despite the growing demand and complexity of multi-turn LLM conversations with humans, there are limited comprehensive evaluation frameworks designed for multi-turn scenarios. Some of them, such as the widely adopted MT-Bench[1], are saturated by frontier LLMs with near-perfect results. Others[2] focus more on multi-turn explicit instruction following, missing the opportunity to assess the actual set of mixed model capabilities required in conducting natural multi-turn conversations with human users.

To bridge this gap, we present MultiChallenge, a pioneering benchmark evaluating large language models (LLMs) on conducting multi-turn conversations with human users, a crucial yet underexamined capability for their applications. MultiChallenge identifies four categories of challenges in multi-turn conversations that are not only common and realistic among current human-LLM interactions, but are also challenging to all current frontier LLMs. The 4 challenges are instruction retention, inference memory of user information, reliable versioned editing, and self-coherence. All 4 challenges require accurate instruction-following, context allocation, and in-context reasoning capabilities at the same time.

Instruction retention evaluates whether LLMs are able to follow instructions specified in the first user turn throughout the entire multi-turn conversation. Inference memory of user information evaluates LLMs on recalling and connecting relevant details scattered in previous user turns when they are implicitly required to respond to the final user turn. Reliable versioned editing evaluates whether LLMs can properly help humans revise existing materials through back-and-forth iterations with human users. Finally, self-coherence evaluates whether LLMs can be reasonably coherent with model responses in the conversation history and avoid sycophancy (unconditionally agreeing to human users). For a more detailed definition of the 4 challenges, please check out the full paper.

See the full paper here.

See the open source benchmark and auto–eval tool here.

Data Curation Methodology

Producing realistic, diverse and challenging test examples for MultiChallenge that can make most frontier LLMs fail, is a difficult and time-consuming task even for human experts.

Therefore, to facilitate human experts and reduce cost while still maintain data quality, we construct MultiChallenge with a hybrid approach, in which we synthetically generate data first and then have human experts to review and edit such synthetic data.

More specifically, we first adopt a multi-agent synthetic data generation system to generate synthetic multi-turn conversations that follow the definition of the 4 challenges and create model failures. Then we recruit and train human annotators to review and edit the synthetic data to produce final test examples in MultiChallenge with high quality. The human review process mainly assesses 3 aspects of data quality,

a) if the synthetic multi-turn conversation is aligned to its challenge category definition;

b) if the conversation is natural and realistic;

c) if 6 frontier LLMs fail reasonably or not. We only accept test examples that cause at least 3 of the 6 frontier models (o1-preview, GPT 4o (August 2024), Gemini 1.5 Pro (August 27, 2024), Claude 3.5 Sonnet (October 2024), Mistral Large 2, and finally Llama 3.1 405B Instruct) to fail ensuring the representativeness of the challenges in all samples.

After reviewing, if the synthetic conversation's quality is not satisfactory in any of the 3 criteria mentioned above, human annotators either edit or discard the synthetic example.

Automatic Evaluation with Instance-level Rubrics

Rule-based automatic evaluation methods don't apply to MultiChallenge because there is no single ground-truth answer for most test conversations in MultiChallenge.

Moreover, we've found that directly applying frontier LLMs as judge by providing the full multi-turn conversation history and prompting them to evaluate model responses, leads to very low alignment with human raters. It is potentially because all frontier LLMs such as GPT-4o, have less than 50% accuracy on MultiChallenge, limiting our trust that they can correctly judge the performance of other models on MultiChallenge. Pure human evaluation on MultiChallenge is also expensive and time-consuming.

Therefore, we propose LLM as judge with instance-level rubrics to facilitate an automatic evaluation method with fair agreement with experienced human raters.

Specifically, at the final step of producing each test example, we instruct human raters to provide a binary rubric question that only allows for a "yes" or "no" answer.

This binary question only requires the final model response as context to answer. A "yes'' answer indicates that the model response has passed this test example and vice versa.

We also make the binary question within the capability of current LLMs.

For example, the binary rubric question for the inference memory example shown above is, "does any of the dessert recipes suggested in this response contain any nuts?''.

Through this method, we make automatic evaluation on MultiChallenge possible and reliable. Experiments show that adopting frontier models as judges with our instance-level rubrics reaches 93% alignment with experienced human raters, compared to 36% alignment of directly prompting LLMs as judge by providing raw conversation context. Details of this experiment are shown in Section 5.2 of the paper.

Methodology

Leaderboard rankings are determined using Rank (Upper Bound), which reflects a model’s statistical position based on confidence intervals. The ranking process follows these steps:

  1. Count the number of models that are statistically significantly better than the target model.

  2. Add 1 to this count to determine the model’s rank.

A model is considered statistically significantly better than another if its lower-bound score (95% confidence interval) is higher than the other model’s upper-bound score.Models receive the same rank when the same number of models are statistically better than each of them. This approach groups models based on statistical significance rather than raw scores, ensuring rankings reflect meaningful performance differences.

Limitations

Our research marks a significant step forward in assessing the multi-turn conversation capabilities of large language models (LLMs). However, it is not without its challenges.

One constraint is that in order to make the LLM as judge trustworthy, we cannot include examples in which even the corresponding instance-level rubric question is beyond current frontier LLM capability. This potentially limits the difficulty level of the entire evaluation benchmark. We keep such difficult test examples and keep monitoring the evolution of LLM capability. We will release the difficult data set to the public once their auto-eval using LLM is possible.

Another constraint is that this benchmark is inevitably biased against the 6 frontier models listed above, due to the fact that the test examples are picked according to the 6 frontier models' common failures. Therefore, it might not be strictly fair to directly compare other models’ performance with the 6 LLMs listed above.

However, given the potential bias against the frontier models, we still observe that all open source models we tested later fall behind top-performing closed source models such as Claude 3.5 Sonnet and o1-preview.

Reference

[1]Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623.

[2] Yun He, Di Jin, Chaoqi Wang, Chloe Bi, Karishma Mandyam, Hejia Zhang, Chen Zhu, Ning Li, Tengyu Xu, Hongjiang Lv, et al. 2024. Multi-if: Benchmarking llms on multi-turn and multilingual instructions following. arXiv preprint arXiv:2410.15553.

Loading content...
Last updated: August 7, 2025

Performance Comparison

1

63.77±1.53

2

58.55±3.03

3

59.09±1.08

4

56.51±1.82

4

55.18±2.44

5

53.90±0.84

5

53.46±1.59

5

53.12±4.40

6

52.62±1.53

6

51.58±1.98

6

51.34±1.85

7

51.91±0.99

7

51.62±1.35

7

49.63±2.56

7

49.37±2.54

8

49.91±1.89

8

49.82±1.36

10

49.84±0.72

11

47.65±2.41

16

o1 (December 2024)

44.93±3.29

16

43.83±4.71

20

44.55±0.86

20

43.77±1.60

20

Claude 3.5 Sonnet (October 2024)

43.20±3.07

20

42.99±3.17

21

42.89±2.25

23

40.53±1.72

23

40.09±2.89

23

39.89±2.64

23

38.26±3.97

24

40.67±1.32

25

Gemini 2.0 Flash Thinking Experimental (January 2025)

37.78±3.67

25

36.88±4.25

30

o1-preview

37.28±0.69

30

35.81±2.50

33

o1-mini

34.49±1.43

33

Gemini 2.0 Flash Experimental (December 2024)

33.51±2.84

33

32.19±3.18

35

32.29±1.61

35

32.01±1.40

37

32.06±0.70

42

GPT-4o (November 2024)

27.81±1.44

43

26.65±0.42

43

GPT-4 (November 2024)

25.22±2.29

45

Llama 3.3 70B Instruct

24.84±0.55

45

20.73±3.64

46

Gemini 1.5 Pro Experimental (August 2024)

21.59±2.60

47

20.30±1.40

47

Qwen 2 72B Instruct

19.99±2.84

47

Qwen 2.5 14B Instruct

18.34±1.06

49

Qwen 2.5 72B Instruct

17.34±0.74

49

Llama 3.2 3B Instruct

17.00±1.87

50

15.04±2.20

53

Llama 3.1 405B Instruct

16.22±0.34

53

Mistral Large 2

15.23±1.04

54

GPT-4o (August 2024)

12.16±3.52

56

Mixtral 8x7B Instruct v0.1

11.92±1.67

Rank (UB): 1 + the number of models whose lower CI bound exceeds this model’s upper CI bound.