How to Tame the Long Tail in Machine Learning

byon June 29, 2021

Many AI systems rely on supervised learning methods in which neural networks

train on labeled data. The challenge with supervised methods is getting models

to perform well on examples not adequately represented in the training

dataset. Typically, as the frequency of a particular category decreases, so

does average model performance on this category (see Figure 1). It is often

difficult and costly to achieve strong performance on the rare edge cases that

make up the long tail of a data distribution. In this blog post, we’ll take a

deeper look at how sophisticated data curation tools can help machine learning

teams target their experiments toward taming the long tail.

Figure 1: Long Tail Distribution

Figure 1: This type of distribution, in which there are a few common

categories followed by many rare categories, is called a long tail

distribution. In the majority of deep learning applications, datasets

collected in the real world tend to have this long-tail shape.

Figure 2: Ubiquity of Long Tail Distributions

Figure 2: Long tail distributions occur frequently in the real world. For

example, the frequency of words in English writing follows a long-tail


The ugly truth is that almost all AI problems worth solving are made

difficult by the challenge of a long-tail. Imagine you’re an engineer at an

autonomous vehicle company, working on an object detection model trained on

image data captured from vehicles on the road. This real-world data provides

a textbook long-tail distribution: There are many thousands of images of

cars on the highway, but very few depicting bicycles at night (see Figure

3). In a preliminary experiment, the subsequent model performs badly when

localizing cyclists at night (see Figure 4). Neglecting this rare scenario

is likely to result in high-severity errors during testing, but it’s

critical for the algorithm to detect cyclists properly in order to prevent

collisions. What’s the best method to address this model failure in a

targeted way? There are many approaches to improving performance, including

experimenting with new architectures or additional hyperparameter tuning.

But if the goal is to improve model performance on this specific edge case,

the most targeted approach is to add more examples of cyclists to the

training dataset. The problem is that unlabeled data is inherently difficult

to search; sampling randomly, one is unlikely to find more cyclists at

night, and it’s too expensive to simply label all collected data.

Figure 3: Berkeley Deep Drive Class Distribution

Figure 3: In the Berkeley Deep Drive dataset, the category “rider” occurs

infrequently compared to the category “car.” It’s no surprise that when we

view individual examples of riders, our model trained on Berkeley Deep Drive

does not do a good job of localizing these objects.

Figure 4: Examples of poor localization for the class “rider”

Figure 4: Object detection predictions from an EfficientDet D7 architecture

trained on Berkeley Deep Drive. Examples of poor localization for the class

“rider”, a long tail class in the training dataset.

To improve performance on the long tail of edge cases, machine learning

practitioners can get caught up in an endless cycle of collecting more

training data, subsampling, and retraining the model. While this type of

experimentation is essential to machine learning development, it is

incredibly costly with respect to time, compute, and data labeling. The cost

of improving performance on the long tail often threatens the economics of

AI products.

This blog post

from Andreesen Horowitz details how AI businesses seldom have the attractive

economic properties of traditional software businesses. Where software

products benefit from economies of scale, AI businesses experience the

opposite; as model performance improves over time, the marginal cost of

improvement increases exponentially. It may require ten times the initial

training data to yield any significant model improvement. Given that

expensive model retraining is unavoidable, machine learning teams should

focus on making iterative experimentation as efficient as possible.

One way to focus experiments on improving the long tail is to use model

failures to identify gaps in the training dataset and then source additional

data to fill those gaps. Think of this approach to machine learning

experimentation as “mining the long tail.” With each experiment, identify a

failure case, find more examples of this rare scenario, add new examples to

the training dataset, and repeat.

Figure 5: ML Development Lifecycle

Figure 5: The ML development lifecycle consists of collecting, generating,

annotating, managing or curating, training, and evaluating. The data

curation or management step has the biggest impact on an experiment’s

success, yet is most overlooked in ML discourse.

It’s difficult to target experiments in the most under-represented scenarios

without sophisticated tools for dataset curation. There are three

operationally critical capabilities for most AI teams:

  1. Identify long-tail scenarios during model evaluation
  2. Reliably and repeatedly source similar examples from unlabeled data
  3. Add these examples to a labeled training dataset.

Some of the most effective teams we work with at Scale face the challenge of

the long tail head-on by building their tools, operations, and workflows

proactively around an iterative process of data curation. Hussein Mehanna,

VP & Head of AI/ML at Cruise, highlighted this at Scale’s Transform

conference earlier this year: “You have to have a robust, fast, continuous

learning cycle, so when you find something that you haven't seen before, you

add it quickly to your data, and you learn from it, and you put it back into

the model.” This idea of targeted dataset curation is simple; in practice,

however, it’s difficult to achieve because of the substantial engineering

work required to index millions of raw images and surface difficult

scenarios automatically during model evaluation.

Let’s put ourselves back in the shoes of an ML engineer. Having diagnosed

poor performance on cyclists at night, we now want to target the next

experiment at improving this scenario. With a complete set of data curation

tools, this workflow would be simple. From one difficult example, source

similar images from the large set of unlabeled data and send it through the

labeling pipeline for annotation. Once this new batch of data is annotated,

include it in the training dataset for subsequent experiments. We call this

data-centric approach to experimentation model-in-the-loop dataset curation.

Not only is this a highly targeted approach for addressing long-tail

performance but, unlike random sampling, it results in better balance across

classes and avoids labeling redundant examples that yield little marginal

improvement of model performance.

Performance on the long tail is often a make-or-break situation when it

comes to AI in production. Especially in high-impact applications like

healthcare — where data is difficult to collect and equity is critical —

being proactive on edge cases is of utmost importance. Given the ubiquity of

long-tail challenges in real-world applications, it’s surprising that there

isn’t more discussion around how to tailor machine learning experiments

toward these cases. ML teams should not run the risk of leaving precious

rare scenarios undiscovered in a sea of unlabeled data. When it comes to the

long tail, targeted dataset curation is an important tool for achieving

performant AI models robust to the complexities of the real world.

The future of your industry starts here.