How to Handle Long Tail Data Distributions in Machine Learning

If code quality is what sets apart good software systems from bad ones, data quality is what distinguishes great AI systems from the ones that will simply never make it to production. While most of the research in machine learning (ML) continues to focus on models, industrial players building AI systems are focusing more and more on the data. The number of companies adopting a data-centric approach to AI development (understood as systematically engineering datasets used for AI) is growing rapidly across industries and levels of company maturity. According to former Tesla Autopilot lead Andrej Karpathy, engineers at Tesla lose sleep over the dual challenges of structuring their datasets effectively and handling the edge cases that might come up.

Graphic adapted from Andrej Karpathy’s slide deck [Train AI 2018]

But why is using the right data so central for building out high-performing ML models? The answer lies in unequal data distribution and low-quality ground truth (annotations).

Try Scale Nucleus

Not all data is created equal

The core hypothesis behind data-centric AI is that you can achieve better ML model performance by training your model on higher-quality data. In an ideal world, your model performs equally well in all kinds of situations. For example, in autonomous driving, you would want a model detecting pedestrians to work equally well, irrespective of the weather, visual conditions, how fast the pedestrian is moving, how occluded they are, et cetera. Most likely however, your model will perform much worse on cases that are more rare—for example, a baby stroller swiftly emerging from behind a parked car in inclement weather. 

The point of failure here is that the model has been trained on data that was recorded during regular traffic conditions. As a result, the representation of these rare scenarios (as a portion of the entire training dataset) is much lower compared to common scenarios. Below is another example of two highway scenarios, whereas lane detection will be significantly more difficult in the right hand picture compared to the left.

What can we do about this? We need to increase the prevalence of these rare cases in our labeled training data.

Accuracy improvements with data selection via active learning versus random sampling given a fixed labeling budget.

Source: Emam et. al., 2021. Active Learning at the ImageNet Scale

In ML research, this approach is often referred to as active learning. Many papers (e.g. Active Learning at ImageNet Scale) have demonstrated the effectiveness of this approach, showing significant performance improvements by picking more difficult (most often underrepresented) training data instead of random sampling. At Scale AI, this is also what we observe with our customers, especially those with advanced ML teams and models deployed in production. Companies that are the best at picking the right data for labeling—and eventually training—have the highest performing models. Unfortunately, identifying the highest impact data is difficult, and most ML practitioners perform this task manually, even when wrangling very large datasets.

Labeling errors are breaking your model

To make things worse, almost all datasets include labeling errors. Labeling errors are more likely to occur on edge cases. This makes sense, because images that have been seen 1,000 times are easier to label correctly compared to an image seen only once. Labeling errors are also most destructive for your model performance if they happen on edge cases, because you have a limited amount of these samples in the first place. So if the labels on these difficult samples are wrong, model performance will never improve. Instead, it’s more likely to quickly regress.

Example of a typical l labeling error on the public Berkeley Deep Drive dataset: class confusion between car (the annotation) and bus (the prediction). 

As a result, finding and cleaning bad labels (generally, but especially for rare cases) is probably one of the most impactful activities you can do to meaningfully improve model performance. Many teams do this already. Unfortunately, though, doing this consistently, especially on large datasets while keeping track of issues and measuring progress, is often a very manual and isolated process. Many times, teams will use Jupyter notebooks to find bad labels and keep track of them in spreadsheets.

Doing things manually will only work for so long, and it will not get your model into production anytime soon. To address this we have built Scale Nucleus, a data management tool that helps to decide what data to label and make sure that these labels are consistently of high quality.

In the rest of this article, we’ll walk through common strategies to mitigate the problems outlined above, starting with unequal data distribution.

Use similarity search to get from one failure case to many

One of the most common ways to build good training data is to look at a single sample that is particularly difficult for the model (e.g. based on pure luck, a field report, domain knowledge or other data points) and try to find more of the same.

In Nucleus, you can do this by selecting a few images of interest in your dataset and running a similarity search. For example, you might know that perception performance is bad on blurry images. In this case, you’d select a single blurry image and run a similarity search.

However, a single similarity search might yield mixed results; not everything is equally relevant. Furthermore, you might want to codify the issue you were looking for into a “tag” and consistently find all images that have this issue. In our example, you might want to make sure every image in your dataset can be identified as blurry or not blurry going forward.

One way to address this is using a Nucleus Autotag. Autotag is a way of teaching Nucleus about a new concept, by interactively selecting more and more specific examples of a case, for example blurry images, in a few refinement steps. In just a few minutes, the platform will “learn” a concept like blurriness and enable you to search your data using this attribute going forward. In the background, we are training a binary classifier, which assigns and updates a similarity score for the concept you are searching for across your whole dataset. You can learn more about Autotag here.


Find rare edge cases with natural language search

Many times, an ML engineer might know (e.g. based on domain expertise) what kind of edge case data is difficult for their model, but does not immediately have an example of this at hand. Edge cases are often also environment conditions or situations, for which you have neither annotations and nor does your model predict them—yet. Examples might be specific lighting conditions, occlusions, or novel objects types. 

To start, you might run a Natural Language Query to find these examples. For example, you might want to find all frames of pizza boxes in the famous ImageNet dataset. It is important to note that this style of query works on data without any annotations, model predictions or metadata because the samples you are trying to find are edge cases which are not yet detected by your computer vision model.

One way of solving the “no labels challenge” is to use precomputed embeddings that are matched against a natural language query, e.g. using OpenAI’s CLIP model. Nucleus generates CLIP embeddings automatically for all images in order to support natural language search.

Natural Language Search

Once you’ve built a more balanced dataset and enriched it with annotations, now you’re ready to ensure these labels are consistently of high quality, especially for edge cases.

Review computer vision labels

Now that you have a set of image annotations, you’ll want to verify that they are of high quality before using them as training data, especially for rare edge cases. This can either be done image by image, or in batches as long as the data annotations can be easily inspected and compared against the actual image content.

To improve label quality in the next iteration of labeling, it makes sense to mark each individual frame and bounding box as accepted or rejected and later query against these attributes to make sure all items are reviewed. If you encounter a particular issue, such as a badly fitted bounding box, a good option is to add the relevant images to a “slice”, which is a curated subset of data to share with colleagues or to return to at a later point.

Find mistakes in computer vision labels

Instead of going through all your data manually, it is often advisable to compare the labels you are reviewing against the predictions of your most recent computer vision model. There are several so-called model-assisted methods to find and clean labels such as sort-by-loss, sort-by-confidence where model and ground truth disagree, or confident learning (paper). It is in fact often surprising how effective methods such as sort-by-confidence are to improve label quality. ML practitioners do this by computing common error metrics such as the aggregated IOU, a confusion matrix, or a precision-recall (PR) curve (e.g. in a Jupyter notebook) and then inspecting the samples where the prediction of the model and the data annotation are least aligned.

However, using a GUI this process can be 10x faster, because error metrics are pre-computed automatically and can be directly explored with interactive charts. Clicking into any graph automatically queries and surfaces all the samples associated with that data point. For example, clicking into the cell for “bus versus truck” confusions in the matrix displayed below queries your dataset for all samples where the data label is a truck but the computer vision model predicts a bus.

From just a single glance, we can see that these are buses that are incorrectly labeled as trucks. These image annotations need to be corrected to avoid bad model performance for both bus and truck detections.

Find missing labels

Finally, it is critical to ensure that no labels are missing in your annotated image data. A missing label can lead to a performance regression of your model, especially if the object with the missing label is easily detected by your model.

In the above example, all cars are annotated in the image, but the single white vehicle is just clearly omitted from any detections. A perception model for vehicles will easily detect this car. However, if it is re-trained with the shown bounding boxes as ground truth, it learns to disregard white cars, which would likely be a dangerous failure mode.

The confusion matrix is again a good starting point to find these cases and make sure they are corrected. We can find the row with “no ground truth” in the bottom and query for examples where the model is currently still predicting an object.

With this query, we can swiftly find all cases that are obvious misses and make sure that the data annotations are corrected for these images.

Dealing with long tail data distributions: the TL;DR

We talked about two common problems when training high-performance computer vision models: unequally distributed data with rare edge cases and bad image annotations.

Both research and industry agree that picking out the right data to annotate leads to much better model performance, because it corrects for the naturally unequal data distribution encountered in the real world. The model can learn how to deal with edge cases by being trained on more data that contains them. Secondly, almost all datasets contain annotation errors, while computer vision models work on the assumption that all annotations are correct. This causes the models to perform worse, in practice. This problem is most pronounced for rare edge cases, as these are harder to annotate and they have a worse impact if they are annotated incorrectly.

There are proven methods to cope with these problems. Good strategies to pick high quality data include starting from difficult images and finding more of the same or running semantic search for situations that are known to be problematic. Second, we have learned that reviewing the quality of data annotations is essential. Using error metrics that compare annotations with current model predictions (e.g. with a confusion matrix) is a good way to quickly surface the lowest quality data labels or even missing annotations.

Getting the edge cases right is considered to be the “holy grail” of building high performance computer vision models, and with the right techniques, you don’t have to lose sleep over them any longer.

Ready to try Nucleus yourself?

The future of your industry starts here.