How to Tame the Long Tail in Machine Learning

by Sasha Harrison on June 29th, 2021

How to Tame the Long Tail in Machine Learning cover

Many AI systems rely on supervised learning methods in which neural networks train on labeled data. The challenge with supervised methods is getting models to perform well on examples not adequately represented in the training dataset. Typically, as the frequency of a particular category decreases, so does average model performance on this category (see Figure 1). It is often difficult and costly to achieve strong performance on the rare edge cases that make up the long tail of a data distribution. In this blog post, we’ll take a deeper look at how sophisticated data curation tools can help machine learning teams target their experiments toward taming the long tail.

Figure 1: Long Tail Distribution
Figure 1: This type of distribution, in which there are a few common categories followed by many rare categories, is called a long tail distribution. In the majority of deep learning applications, datasets collected in the real world tend to have this long-tail shape.
Figure 2: Ubiquity of Long Tail Distributions
Figure 2: Long tail distributions occur frequently in the real world. For example, the frequency of words in English writing follows a long-tail distribution.

The ugly truth is that almost all AI problems worth solving are made difficult by the challenge of a long-tail. Imagine you’re an engineer at an autonomous vehicle company, working on an object detection model trained on image data captured from vehicles on the road. This real-world data provides a textbook long-tail distribution: There are many thousands of images of cars on the highway, but very few depicting bicycles at night (see Figure 3). In a preliminary experiment, the subsequent model performs badly when localizing cyclists at night (see Figure 4). Neglecting this rare scenario is likely to result in high-severity errors during testing, but it’s critical for the algorithm to detect cyclists properly in order to prevent collisions. What’s the best method to address this model failure in a targeted way? There are many approaches to improving performance, including experimenting with new architectures or additional hyperparameter tuning. But if the goal is to improve model performance on this specific edge case, the most targeted approach is to add more examples of cyclists to the training dataset. The problem is that unlabeled data is inherently difficult to search; sampling randomly, one is unlikely to find more cyclists at night, and it’s too expensive to simply label all collected data.

Figure 3: Berkeley Deep Drive Class Distribution
Figure 3: In the Berkeley Deep Drive dataset, the category “rider” occurs infrequently compared to the category “car.” It’s no surprise that when we view individual examples of riders, our model trained on Berkeley Deep Drive does not do a good job of localizing these objects.
Figure 4: Examples of poor localization for the class “rider”
Figure 4: Object detection predictions from an EfficientDet D7 architecture trained on Berkeley Deep Drive. Examples of poor localization for the class “rider”, a long tail class in the training dataset.

To improve performance on the long tail of edge cases, machine learning practitioners can get caught up in an endless cycle of collecting more training data, subsampling, and retraining the model. While this type of experimentation is essential to machine learning development, it is incredibly costly with respect to time, compute, and data labeling. The cost of improving performance on the long tail often threatens the economics of AI products. This blog post from Andreesen Horowitz details how AI businesses seldom have the attractive economic properties of traditional software businesses. Where software products benefit from economies of scale, AI businesses experience the opposite; as model performance improves over time, the marginal cost of improvement increases exponentially. It may require ten times the initial training data to yield any significant model improvement. Given that expensive model retraining is unavoidable, machine learning teams should focus on making iterative experimentation as efficient as possible.

One way to focus experiments on improving the long tail is to use model failures to identify gaps in the training dataset and then source additional data to fill those gaps. Think of this approach to machine learning experimentation as “mining the long tail.” With each experiment, identify a failure case, find more examples of this rare scenario, add new examples to the training dataset, and repeat.

Figure 5: ML Development Lifecycle
Figure 5: The ML development lifecycle consists of collecting, generating, annotating, managing or curating, training, and evaluating. The data curation or management step has the biggest impact on an experiment’s success, yet is most overlooked in ML discourse.

It’s difficult to target experiments in the most under-represented scenarios without sophisticated tools for dataset curation. There are three operationally critical capabilities for most AI teams:

  1. Identify long-tail scenarios during model evaluation
  2. Reliably and repeatedly source similar examples from unlabeled data
  3. Add these examples to a labeled training dataset.

Some of the most effective teams we work with at Scale face the challenge of the long tail head-on by building their tools, operations, and workflows proactively around an iterative process of data curation. Hussein Mehanna, VP & Head of AI/ML at Cruise, highlighted this at Scale’s Transform conference earlier this year: “You have to have a robust, fast, continuous learning cycle, so when you find something that you haven't seen before, you add it quickly to your data, and you learn from it, and you put it back into the model.” This idea of targeted dataset curation is simple; in practice, however, it’s difficult to achieve because of the substantial engineering work required to index millions of raw images and surface difficult scenarios automatically during model evaluation.

Let’s put ourselves back in the shoes of an ML engineer. Having diagnosed poor performance on cyclists at night, we now want to target the next experiment at improving this scenario. With a complete set of data curation tools, this workflow would be simple. From one difficult example, source similar images from the large set of unlabeled data and send it through the labeling pipeline for annotation. Once this new batch of data is annotated, include it in the training dataset for subsequent experiments. We call this data-centric approach to experimentation model-in-the-loop dataset curation. Not only is this a highly targeted approach for addressing long-tail performance but, unlike random sampling, it results in better balance across classes and avoids labeling redundant examples that yield little marginal improvement of model performance.

Performance on the long tail is often a make-or-break situation when it comes to AI in production. Especially in high-impact applications like healthcare — where data is difficult to collect and equity is critical — being proactive on edge cases is of utmost importance. Given the ubiquity of long-tail challenges in real-world applications, it’s surprising that there isn’t more discussion around how to tailor machine learning experiments toward these cases. ML teams should not run the risk of leaving precious rare scenarios undiscovered in a sea of unlabeled data. When it comes to the long tail, targeted dataset curation is an important tool for achieving performant AI models robust to the complexities of the real world.