Large-scale datasets are at the center of model development and evaluation. Quality of training data is the dominant factor for model performance. Anyone can create an excellent model based on great data, but even the best ML engineers can’t design good models with bad data.
If you’ve ever used popular public datasets such as ImageNet or MNIST, you would assume that the class and attribute labels are correct. Shockingly, ImageNet, the most widely used dataset for popular image recognition systems, has a 6% error rate. When models are fed bad data, such as incorrect classes, incorrect features, or missing annotations, they often have severe classification problems, which can destabilize ML benchmarks and contribute to representational harms in datasets.
Examples of representational harms in datasets include under-representation of darker skinned subjects compared with lighter-skinned subjects within prominent facial analysis datasets[2,3], and millions of images of people in ImageNet labeled with offensive categories, including racial slurs and derogatory phrases.
Even in leading applied ML research groups, approaches for characterizing and fixing labeling errors in massive datasets are limited.
One approach to combat noise in a dataset is Confident Learning (CL), which is based on pruning noisy data, counting with probabilistic thresholds to estimate noise, and ranking examples to train with confidence. However, CL entirely ignores the case where one or more true labels are missing. This limitation occurs because CL relies on building a pairwise class-conditional distribution between the given and true labels. Other techniques include sorting by model loss, which surfaces the examples that are most ambiguous and difficult to learn, and sorting by confidence where model and ground truth disagree (quite effective!).
What is the most principled option after amassing a collection of bad labels? In practice, it would be fixing your dataset by going through the images and fixing the annotations one by one.
This is both laborious and impractical.
That’s why (1.), a way to visualize, triage, and organize your data, and (2.) a way to fix your annotations, are critical to your data annotation project.
Scale provides an integrated dataset and workforce management platform that solves these problems.
Let’s walk through an example of how you can use two of Scale’s products: Scale Nucleus and Scale Studio to identify and fix bad labels leveraging model predictions. For this example, we’ll be using the public Berkeley Deep Drive dataset.
Starting on the “Charts” page in Scale Nucleus, a simple click into the confusion matrix shows 2,277 cases in which models predicted “person” but the ground truth label is “car”.
After sorting the results by descending model confidence, a brief scan of the corresponding objects makes it clear that these annotations are wrong.
This may be surprising given that the Berkeley Deep Drive dataset is being used by thousands of researchers and models, but almost all datasets have these kinds of issues.
Now that you have discovered some errors with the data labels, you might want to correct them.
With one click, you can send the bad labels over to Scale Studio, our self-service annotation platform, where your workforce can review and fix the errors. The annotation results will be automatically imported back into Scale Nucleus, overwriting the previous, flawed set of labels.
At Scale, we are perpetually focused on both data and label quality. That’s why we build tools, products, and platforms to ensure that you’re getting the best results you possibly can from your ML models. But building high-performance models starts with error-free datasets. As you are exploring your dataset in Nucleus, verifying label accuracy, identifying edge cases, and collecting mismatches between predictions and ground truth, you may find that you and your teammates want to collaborate on adjusting or improving your ground truth labels. If you want to make these edits in house, in the most efficient way possible, sign up for Scale Studio today.
- Crawford, Kate. "The trouble with bias." Conference on Neural Information Processing Systems, invited speaker. 2017.
- Buolamwini, Joy, and Timnit Gebru. "Gender shades: Intersectional accuracy disparities in commercial gender classification." Conference on fairness, accountability and transparency. PMLR, 2018.
- Raji, Inioluwa Deborah, and Genevieve Fried. "About face: A survey of facial recognition evaluation." arXiv preprint arXiv:2102.00813 (2021).
- De Vries, Terrance, et al. "Does object recognition work for everyone?." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2019.
- Crawford, Kate, Trevor Paglen, and A. I. Excavating. "The politics of images in machine learning training sets." Excavating AI.
- Northcutt, Curtis, Lu Jiang, and Isaac Chuang. "Confident learning: Estimating uncertainty in dataset labels." Journal of Artificial Intelligence Research 70 (2021): 1373-1411.