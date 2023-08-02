Introduction

Scale AI is one of the world’s leading suppliers of high-quality foundation model training data. In recent months we’ve seen a marked increase in demand for PhD-level, multimodal reasoning data across a multitude of domains, including math, coding, science, and humanities. Because of the ever-increasing difficulty of such expert-level data, Scale AI invests in frontier research on quality control (QC) with LLM agents, also known as “autoraters.” In this blogpost, we present a human-in-the-loop, in-house, state-of-the-art approach to autorate reasoning data leveraging model debate1. When paired with our existing human review pipeline, using this approach can help us detect four times as many incorrect answers to reasoning problems in our most advanced dataset as using a single model - from 23% to 82%. Furthermore, when we used this approach to provide live feedback during the data-labeling process, our human experts that actually engaged with the copilot saw a ~87% improvement in quality when writing their answers.

In the past year as LLMs grow ever more capable, PhD-level data in reasoning-heavy domains has been paramount for training models via RL to be adept at extended thinking. At this level of difficulty, LLMs often perform poorly (the best score on Humanity’s Last Exam (HLE), a representative benchmark, is less than 30%), and even human experts may disagree on the final solution. Consequently, Scale AI needed to invest in building accurate review pipelines using multiple LLMs and humans in collaboration in order to ensure any data delivered to customers is correct. Our existing data labeling pipelines for PhD-level data consisted of a blind-solve “attempter” round and a blind-solve “reviewer” round, both done by human experts, followed by a discussion round if needed to reach consensus, followed by a final human QC round. Every datapoint receives at least three human touches. However, humans are imperfect: sometimes the attempter or reviewer might make mistakes, one might be more knowledgeable than the others, the discussion does not achieve consensus, or one side drops the ball. Inserting a consistently precise autorating process to filter out lingering errors is critical to minimizing chances of delivering incorrect data to customers.

As evidenced by our pipeline, problems in our PhD-level dataset often cannot be solved by just one human and need collaboration between two or more experts to agree on a final solution. We hypothesized that our autoraters also need to leverage interaction between multiple viewpoints. Thus, we developed a multi-agent LLM-as-a-judge autorater system that uses model debate to enable precise, automatic QC. Our system is designed to evaluate whether a proposed answer to a difficult reasoning question, and the reasoning trace behind the answer, is correct.

Our agentic framework is compatible with various underlying foundation models, allowing for interchangeability. For this blogpost, we evaluated our in-house approach with both open-source and closed-source models to demonstrate the strength of our agent. Internally, however, Scale AI uses open-source LLMs running on our own servers to protect customer trust and data confidentiality.

To test our agents, we created an expert-level autorating eval set built from both our private human-labeled data and public reasoning benchmarks. Results on this eval set show that our custom in-house QC methods are robust for even the most difficult of problems in various domains. They can also more precisely identify correct data and flag incorrect data compared to single-model-based QC methods, the default approach for many LLM-as-a-judge applications.

Evaluation Datasets

We sourced our eval set’s questions and ground-truth correct answers from the following:

Humanities Last Exam - text-only subset (Open HLE) 2 : A challenging reasoning dataset composed of expert-level questions in various domains, including mathematics, humanities, and natural sciences.

Closed HLE: A proprietary Scale dataset of over 500 HLE-level problems spanning mathematics, humanities, and natural sciences, annotated by contributors with PhDs in the respective fields.

GPQA Diamond3: A highly challenging subset of the GPQA dataset, consisting of extremely difficult multiple-choice questions in biology, physics, and chemistry.

In order to test our autoraters, we need both correct and incorrect answers. Closed HLE already had both as the contributor-submitted answers were already labeled as correct/incorrect in the final human review round. For Open HLE and GQPA Diamond, we generated a synthetic incorrect - yet still plausible - answer and reasoning trace for all questions using a multi-agent pipeline with frontier reasoning models. Finally, we balance all our datasets such that there is an even distribution of correct and incorrect answers.

Our number-one priority is to ensure no bad data is sent to our customers. Thus, we optimized primarily for the percentage of incorrect answers caught. Our secondary focus is to provide a high volume of data, corresponding to minimizing correct answers falsely flagged as incorrect.

Agents and Experiments

Our model debate setup utilizes foundation models in a multi-round debate consisting of two LLM debaters and one LLM judge to provide a final verdict. In our debate approach, the debaters attempt to solve the problem themselves independently, then debate if they disagree with each other or with the provided solution. We call this approach “Solve-then-Debate.”

As baselines4 to compare with our in-house model-debate agents, we used a single call to a reasoning LLM that is given its maximum allowable inference budget, in order to be as close of a comparison as possible to the total budget used by model debate.

Below we show the performance using both closed-source and open-source models. Our setup is as follows:

Closed-source

Model debate: Gemini 2.5 Pro and Claude Sonnet 4 debaters, o3 judge



Single model: Gemini 2.5 Pro and Claude Sonnet 4

Open-source

Model debate: Magistral Small 24B and Qwen 3 32B debaters, Deepseek-R1-distilled Llama 8B judge



Single model: Magistral Small 24B and Qwen 3 32B

Key Results

On the enhanced GPQA Diamond and Open HLE datasets, we ran the experimental setup with closed-source models.

On our privately labeled closed HLE dataset, we ran the same experimental setup with open-source models.

Using model debate approaches we were able to see up to a 3500 basis point increase in the ability to identify low-quality tasks for customers in the closed-source setting and up to a 6700 basis point increase in the open-source setting. Interestingly, the performance differential between single-model and model-debate is higher on average for open-source models than closed-source; we hypothesize that having multiple models collaborate in review produces better results compared to using only one model the weaker the models are on advanced reasoning QC. We did see that model debate mistakenly flags correct answers as incorrect more often than single models; however, the increase in overflagging rate was only 2900 basis points on average, so we view this as a worthwhile tradeoff since our system raises the bar so much on data quality. Simply put, using model debate allows our autoraters to be better at catching errors, letting very few incorrect datapoints through.

Live Labeling Results

Using our model debate autorater system in the LLM-based reviewer round right before the final PhD-level human review round, we saw a 90% reduction in datapoints ultimately flagged by the reviewer (9% to 1%), massively reducing incorrect human annotations propagating through our pipeline.

Previous work5 has demonstrated that humans collaborating with AI can yield much higher performance than either party alone. Having established a successful autorating process for later in the pipeline, we extended our model debate approach to a copilot aiding human contributors during the data-labeling process. Our objective was to proactively nudge experts to course-correct as soon as possible, reducing the necessity for a second expert to fix their work or having to ultimately discard their attempt down the line. We use our debate system and an LLM summarizer on the system’s output to provide live feedback on potential mistakes (if any) in the contributors’ attempted solution and suggestions (if needed) on how to arrive at the correct solution. (Like on all projects, we additionally employed a suite of cheating detectors to ensure that the final submitted data does not contain LLM outputs.) We also made sure to never directly provide the correct answer so we don’t bias the contributor. After deploying the copilot to an internal HLE-level project, we found that 15% of contributors engaged with the copilot’s suggestions. Out of this group, 87% of contributors changed their initially incorrect answer to the correct answer because of the guidance from the copilot. Future work will be focused on how to increase contributors’ engagement with the copilot whether it’s UI/UX changes to improve the feature’s visibility and/or method changes to improve the accuracy of the nudges.

Conclusion

As frontier model training data becomes increasingly complex, it is important to combine the strengths of human data labelers and LLM autoraters in order to maintain a high bar for data quality. Scale AI’s in-house approach outperforms both open- and closed-source reasoning models in reasoning data QC, enabling us to deliver unparalleled value to our customers. Experiments show our multi-agent debate system can catch 82% of errors in expert-level data, as opposed to 23% using a single LLM, and on production projects, reduce the amount of human errors slipping through to the final review layer from 9% to 1%. Combining our system with another human in the loop during the labeling process, we were able to push performance even further, demonstrating a promising path to scalable oversight of advanced AI systems to build even more capable, intelligent models.

