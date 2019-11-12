OpenAI recently published a paper on fine-tuning GPT-2, where they used Scale AI to collect the preferences of human labelers to improve their language models. Although we were already labeling millions of text and computer vision tasks per day at the time, the unique latency requirements and subjective nature of OpenAI’s tasks posed a new challenge for us. In particular: how do you scalably maintain the quality of labels, without having labelers check each other’s work? Today we’re sharing a deep dive into our approach to the problem, the automatic benchmark mining system we built to solve it, and the things we learned along the way. In doing so, we hope to illustrate some of the many challenges that make scalable data labeling such an interesting area of work.

The Problem First, let’s take a look at one of these tasks. Given a snippet of text and some possible continuations for it, the researchers at OpenAI wanted humans to pick the continuation that best fit a given prompt, such as “which continuation has the most positive sentiment” or “which continuation is the most descriptive”.

But instead of labeling this data in a big batch like we usually do (offline), OpenAI wanted to put their model in a closed loop with our labelers, so the data was labeled online: the model would generate some text samples, have the humans rate it using our API, train based on human preferences, and then repeat this process over a few days.

This meant that in order for an experiment to have enough epochs of training, and still finish quickly, the tasks needed to be done with very low latency (<30 mins) at high throughput (~5,000 labels/hr). As we will explain later, these constraints were unique and required us to build additional functionality into our data labeling pipeline. Furthermore, the answers to these tasks were also extremely subjective. As an example, here is a real task where one has to pick which continuation is the most descriptive:

Can you pick an answer with certainty? This ambiguity makes it hard to tell the people that are making a genuine effort apart from the people (or bots!) that are guessing randomly. However, we must make that determination correctly in order to scale this up to thousands of labelers, and still deliver high quality data to our customer. Measuring Quality at Scale When we first built our labeling pipeline, our customers wanted us to prioritize the highest quality data over other factors such as the fastest turnaround SLAs. Now we had to figure out a way to keep our high quality bar while boosting our turnaround from days to minutes. One reason for higher turn-around time was redundancy. On most projects, we send each task through a pipeline of multiple people, either in series or in parallel. Having this redundancy greatly reduces the probability of errors. It also allows us to estimate the accuracies of each person by grading their responses against the completed version of the task. The other reason for higher turn-around time was lagging quality indicators. We usually hold each batch of tasks in a staging state before it is sent back to a customer, so that our confidence indicators on labelers can catch up based on new information. We can then relabel tasks we have low confidence in.

However, due to OpenAI’s latency and throughput constraints, we would not have been able to use either of these features of our architecture — tasks would have to be immediately sent back to the customer after a single person had attempted them. Fortunately, having people review other people’s work is just one of the ways we get signal on labeler quality. For example, on some computer vision projects, one feature that determines labeler quality is based on localization loss from a deep learning model. Benchmark Tasks Another way we measure quality is using a mechanism called benchmark tasks. The idea is to collect high-confidence responses to a subset of tasks first (called benchmark tasks — or a golden dataset in literature), and then use them to estimate the quality of active labelers. The benchmark tasks are sprinkled among new tasks served to people, and disguised to be indistinguishable from them. We can then guarantee that we never return tasks to a customer from labelers who do not meet the quality threshold on benchmark tasks. We had been successfully using benchmark tasks as another way of measuring quality, but these were typically created manually by our in-house quality team. This manual creation would not have worked for this project for a few reasons: we wanted to support OpenAI in creating new projects and iterating quickly on experiments without being blocked on the Scale team creating benchmarks.

due to the ambiguous nature of tasks, it was harder to manually find tasks for which we could have high certainty in the answer. To solve these problems, we augmented our pipeline with a system that automatically mines and maintains a set of benchmark tasks. Once a project came in, our system would take a small subset of tasks from it, and have each one done by multiple trusted labelers. If those people almost all reached consensus that one of the responses on a task was correct, we would make this task a benchmark task, with that response. The broader remote workforce would only be allowed on the project once we had enough benchmarks.