OpenAI recently published a paper on fine-tuning GPT-2, where they used Scale AI to collect the preferences of human labelers to improve their language models. Although we were already labeling millions of text and computer vision tasks per day at the time, the unique latency requirements and subjective nature of OpenAI’s tasks posed a new challenge for us. In particular: how do you scalably maintain the quality of labels, without having labelers check each other’s work? Today we’re sharing a deep dive into our approach to the problem, the automatic benchmark mining system we built to solve it, and the things we learned along the way. In doing so, we hope to illustrate some of the many challenges that make scalable data labeling such an interesting area of work.