Many AI systems rely on supervised learning methods in which neural networks
train on labeled data. The challenge with supervised methods is getting models
to perform well on examples not adequately represented in the training
dataset. Typically, as the frequency of a particular category decreases, so
does average model performance on this category (see Figure 1). It is often
difficult and costly to achieve strong performance on the rare edge cases that
make up the long tail of a data distribution. In this blog post, we’ll take a
deeper look at how sophisticated data curation tools can help machine learning
teams target their experiments toward taming the long tail.
Figure 1: This type of distribution, in which there are a few common
categories followed by many rare categories, is called a long tail
distribution. In the majority of deep learning applications, datasets
collected in the real world tend to have this long-tail shape.
Figure 2: Long tail distributions occur frequently in the real world. For
example, the frequency of words in English writing follows a long-tail
distribution.
The ugly truth is that almost all AI problems worth solving are made
difficult by the challenge of a long-tail. Imagine you’re an engineer at an
autonomous vehicle company, working on an object detection model trained on
image data captured from vehicles on the road. This real-world data provides
a textbook long-tail distribution: There are many thousands of images of
cars on the highway, but very few depicting bicycles at night (see Figure
3). In a preliminary experiment, the subsequent model performs badly when
localizing cyclists at night (see Figure 4). Neglecting this rare scenario
is likely to result in high-severity errors during testing, but it’s
critical for the algorithm to detect cyclists properly in order to prevent
collisions. What’s the best method to address this model failure in a
targeted way? There are many approaches to improving performance, including
experimenting with new architectures or additional hyperparameter tuning.
But if the goal is to improve model performance on this specific edge case,
the most targeted approach is to add more examples of cyclists to the
training dataset. The problem is that unlabeled data is inherently difficult
to search; sampling randomly, one is unlikely to find more cyclists at
night, and it’s too expensive to simply label all collected data.
Figure 3: In the Berkeley Deep Drive dataset, the category “rider” occurs
infrequently compared to the category “car.” It’s no surprise that when we
view individual examples of riders, our model trained on Berkeley Deep Drive
does not do a good job of localizing these objects.
Figure 4: Object detection predictions from an EfficientDet D7 architecture
trained on Berkeley Deep Drive. Examples of poor localization for the class
“rider”, a long tail class in the training dataset.
To improve performance on the long tail of edge cases, machine learning
practitioners can get caught up in an endless cycle of collecting more
training data, subsampling, and retraining the model. While this type of
experimentation is essential to machine learning development, it is
incredibly costly with respect to time, compute, and data labeling. The cost
of improving performance on the long tail often threatens the economics of
AI products.
from Andreesen Horowitz details how AI businesses seldom have the attractive
economic properties of traditional software businesses. Where software
products benefit from economies of scale, AI businesses experience the
opposite; as model performance improves over time, the marginal cost of
improvement increases exponentially. It may require ten times the initial
training data to yield any significant model improvement. Given that
expensive model retraining is unavoidable, machine learning teams should
focus on making iterative experimentation as efficient as possible.
One way to focus experiments on improving the long tail is to use model
failures to identify gaps in the training dataset and then source additional
data to fill those gaps. Think of this approach to machine learning
experimentation as “mining the long tail.” With each experiment, identify a
failure case, find more examples of this rare scenario, add new examples to
the training dataset, and repeat.
Figure 5: The ML development lifecycle consists of collecting, generating,
annotating, managing or curating, training, and evaluating. The data
curation or management step has the biggest impact on an experiment’s
success, yet is most overlooked in ML discourse.
It’s difficult to target experiments in the most under-represented scenarios
without sophisticated tools for dataset curation. There are three
operationally critical capabilities for most AI teams:
- Identify long-tail scenarios during model evaluation
- Reliably and repeatedly source similar examples from unlabeled data
- Add these examples to a labeled training dataset.
Some of the most effective teams we work with at Scale face the challenge of
the long tail head-on by building their tools, operations, and workflows
proactively around an iterative process of data curation. Hussein Mehanna,
VP & Head of AI/ML at Cruise, highlighted this at Scale’s Transform
conference earlier this year: “You have to have a robust, fast, continuous
learning cycle, so when you find something that you haven't seen before, you
add it quickly to your data, and you learn from it, and you put it back into
the model.” This idea of targeted dataset curation is simple; in practice,
however, it’s difficult to achieve because of the substantial engineering
work required to index millions of raw images and surface difficult
scenarios automatically during model evaluation.
Let’s put ourselves back in the shoes of an ML engineer. Having diagnosed
poor performance on cyclists at night, we now want to target the next
experiment at improving this scenario. With a complete set of data curation
tools, this workflow would be simple. From one difficult example, source
similar images from the large set of unlabeled data and send it through the
labeling pipeline for annotation. Once this new batch of data is annotated,
include it in the training dataset for subsequent experiments. We call this
data-centric approach to experimentation model-in-the-loop dataset curation.
Not only is this a highly targeted approach for addressing long-tail
performance but, unlike random sampling, it results in better balance across
classes and avoids labeling redundant examples that yield little marginal
improvement of model performance.
Performance on the long tail is often a make-or-break situation when it
comes to AI in production. Especially in high-impact applications like
healthcare — where data is difficult to collect and equity is critical —
being proactive on edge cases is of utmost importance. Given the ubiquity of
long-tail challenges in real-world applications, it’s surprising that there
isn’t more discussion around how to tailor machine learning experiments
toward these cases. ML teams should not run the risk of leaving precious
rare scenarios undiscovered in a sea of unlabeled data. When it comes to the
long tail, targeted dataset curation is an important tool for achieving
performant AI models robust to the complexities of the real world.