Introduction
At Scale AI, we label on the order of 10MM annotations per week. To deliver
high-quality annotations for this enormous volume of data, we’ve developed a
number of techniques including advanced
provide rich detail about complex environments,
accelerate the labeling process, and
to measure and maintain labeler (Tasker) quality. As we work with more
customers, more Taskers, and more data, we continue to refine these methods to
improve our labeling quality, efficiency, and scalability.
How we use ML
While this vast quantity of data provides Scale AI with invaluable
opportunities to learn and to build upon our annotation processes, it also
enables our Machine Learning team to train models that further augment our
capabilities. We leverage ML models throughout our annotation pipeline
including:
- Pre-labeling to reduce Taskers’ manual work
- Active tooling to maximize Taskers’ speed & efficiency
- ML Linters to check Taskers’ annotations for potential errors
We have observed that ML models offer orders-of-magnitude improvement for
every component of our annotation pipeline. For our customers, this equates
to higher quality, higher throughput, and lower prices. Because ML has such
high potential impact, it is important that our models are not only
highly-performant, but also highly-scalable. This means we have to be very
intentional about how we scale machine learning across our many customers,
task types, and datasets.
Challenges of Vast and Varied Data
The naive approach is to train a model for each customer dataset. But this
doesn’t scale. It would be very expensive for an ML engineer to train and
deploy a new model whenever a customer wants a new dataset. It can also take
significant “ramp time” before we’ve even labeled enough data to train a
useful model. More importantly, research from
that “model performance increases logarithmically based on volume of
training data.” Scale AI has a vast and ever-growing quantity of labeled
data. This means training separate models, each on a small fraction of the
available data, would waste enormous potential performance gains.
If we train a model per dataset, we need to wait for the “ramp time” to
annotate a sufficiently large training set before we can actually leverage
ML. Training models across datasets increases performance, improves
generalization, and eliminates this ramp time.
To effectively leverage our data advantage, we need to train models across
customers’ datasets. Doing so reduces the cost of developing a custom model
per dataset, eliminates the ramp time needed to train a useful model, and
drastically improves model performance. Ultimately, this enhances the
product for our customers. However, using models across customer datasets is
complicated because each dataset is unique. The two primary differences
between customer datasets are the data domain and label taxonomy.
Data Domain
Images from different datasets (COCO, Mapillary,
highlighting domain differences.
In the context of machine learning, a domain is the underlying distribution
your data is sampled from. For example, within the general domain of driving
images, there are a number of factors that affect the data distribution such
as:
- Weather and lighting conditions. (E.g. day vs. night, snow vs. rain, sunny
- vs. overcast)
- The type of sensor. This can affect image perspective, object dimensions,
- color contrast, etc. Cameras at the front of a car will capture different
- data than cameras on the sides.
- The location. An urban environment will feature more occluded objects than
- a highway. Motorcycles will appear more frequently in Thailand than in
- Germany.
Models perform best when the training data distribution is representative of
the target distribution. Ideally, the training and target data come from the
same domain. However, our customers’ datasets exhibit all these variations
and more.
Label Taxonomy
NuScenes image labeled according to three different customer-taxonomies
exhibits naming and missing label issues.
Even if the training data comes from the same underlying distribution as the
target data, the training labels must also be representative of the target
labels. However, all of our customers want their data labeled differently.
For a given dataset, the customer defines a set of rules specifying which
objects to label, what names to give them, how tight to make the bounding
boxes, etc. A taxonomy is the combination of all these rules.
Variations among customer-taxonomies pose several challenges for training
models.
The most obvious issue is that customer-taxonomies use different names for
labeling the same object classes. For example, some customers may label any
person as "Human" while others distinguish between "Pedestrianadult" and "Pedestrianchild." Over the course of my internship I’ve learned more nuanced labels
for people, vehicles, and signs than I’d like to admit.
We face an even greater challenge when customer-taxonomies label different
sets of object classes. For example, in the figure above, Taxonomy A only
labels people, Taxonomy C only labels vehicles, and Taxonomy B labels both
people and vehicles. When training an object detector for people and
vehicles, we’d want to use data from all three taxonomies to maximize
performance. However, images from Taxonomy A with unlabeled vehicles will
encourage the model not to detect vehicles while images from Taxonomy C with
unlabeled people will encourage the model not to detect people. This issue
is particularly problematic because these “missing labels” are systematic.
As we’ve mentioned in
Quantity is no Panacea, random labeling errors may be mitigated by increasing the quantity of
training data, but systematic errors will severely impact model performance.
Tackling the Taxonomy Dilemma
In this section, we’ll walk you through the steps we took to overcome these
challenges. Training across customer datasets should provide our models with
a large, varied training set to cope with the differences in data domain.
But we need some more innovative ideas to cope with differences in taxonomy.
For simplicity, we’ll focus on 2D object detection and we’ll use SOTA object
detection model
EfficientDet recently
released by Google Brain. We’ll detail three different approaches we’ve
explored to tackle the taxonomy dilemma — using the model architecture, the
dataset, and the loss function. As you read on, you’ll see why modifying the
loss function via Taxonomy Loss Masking yields the best solution.
Approach 1: Separate Tasks
EfficientDet backbone with three pairs of taxonomy-specific classification
and regression heads (modified from
Our initial approach takes the perspective of multitask learning — we treat
prediction for each customer-taxonomy as a separate task. In this setting,
we have a shared backbone which generates a taxonomy-agnostic feature
embedding, and a pair of classification and regression heads for each of the
target customer-taxonomies (Figure above). To distinguish differences in
taxonomies, the taxonomy-specific heads are only trained on examples labeled
according to that customer-taxonomy. For training/inference on an example
from customer dataset A, the Taxonomy A classification and regression heads
perform the prediction.
Unfortunately, like the naive approach training a model for each dataset,
this multitask approach requires a “ramp time” before we’ve annotated enough
examples to train a new taxonomy-specific head. Additionally, the cost of
this approach scales with the number of customer-taxonomies because we have
to train a pair of heads for each. Furthermore, this multitask approach is
stymied by the classic problem in multitask learning where each head “fights
for capacity” (Andrej Karpathy, ICML 2019). There exist complex correlations between tasks so the backbone struggles
to learn a representation that satisfies them all. We observe this phenomena
as our model has only mediocre performance.
This first approach uses the model architecture, namely class-specific
heads, to cope with differences in customer-taxonomies. However, we neglect
to tell our model crucial information: customer-taxonomies are closely
related.
Super Taxonomy
The next two approaches rely on the notion of a “super-taxonomy.” Instead of
directly distinguishing between customer-taxonomies, we can define a general
“super-taxonomy” that encompasses them all. Then, we define a mapping from
labels in the customer-taxonomies to labels in the super-taxonomy. This
allows us to encode our priors about relationships between the
customer-taxonomies by treating each as a subset of the super-taxonomy.
The mapping from customer-taxonomies to the super taxonomy.
By mapping customer labels to labels in the super-taxonomy, the individual
customer datasets can be combined into a “super-dataset” with a consistent
naming scheme. However, the super-dataset suffers from the systematic
'missing label issue’ we discussed in the Label Taxonomy section.
As a quick baseline, we tried training directly on this super-dataset. We
were optimistic because we read that “(randomly) dropping 30% of the
annotations… only drops (performance) by 5% on the PASCAL VOC dataset” (Soft Sampling for Robust Object Detection). But even after adding extra tricks like
Background Recalibration Loss, the resulting model had abysmal recall — likely because our missing
labels are systematic, not random.
Approach 2: Separate Datasets
In this second approach, we address the missing label problem by dividing
the super-dataset into separate class-specific super-datasets, one for each
label in the super-taxonomy. As shown in the Figure below, we eliminate the
missing label issue in each of the class-specific super-datasets by
including only customer-taxonomies which label the corresponding class. This
enables us to train a class-specific model for each class-specific
super-dataset. For inference, we simply run all of these class-specific
models on an image and combine their outputs.
Class-specific super-datasets only include customer taxonomies which label
the corresponding class. We train a single class-specific model for each
class-specific super-dataset. All of these models are combined for
inference.
This approach yields good performance because labels within each dataset are
consistent and each model learns a good general representation for its
class. Using so many models would be impractical for most of our
self-driving customers due to the constrained computational resources and
strict latency requirements associated with perception. However, these
limitations don’t apply to Scale AI because labeling operates on a much
longer timescale than perception. Using separate models has the added
benefit of decoupling performance across classes — we can re-train our
Vehicle model without affecting the performance on any other class.
In this approach, the cost of training and inference scales with the number
of classes in our super-taxonomy, rather than the number of
customer-taxonomies. This is a marked improvement over the multitasking
approach if we have a small super-taxonomy. However, our super-taxonomies
are often fairly large. This approach is undesirable because it requires
having many separate models that should theoretically be able to share
features. We’d prefer to use just a single model.
Approach 3: Separate Loss - Taxonomy Loss Masking
What if, instead of telling our model about differences in
customer-taxonomies through the architecture or through separate datasets,
we make this distinction in the loss function? To answer this question,
let’s take a closer look at how our model is trained.
Focal Loss mitigates
class imbalance by re-weighting binary cross entropy loss.
Single stage detectors, like EfficientDet, detect objects by making
predictions at many predefined anchor boxes. Let A be the number of anchor
boxes and K be the number of target object classes. The EfficientDet
classifier head predicts a matrix of probabilities P ∈ RAxK,
where Pij is the probability that there is an instance of class j
at anchor i. The model is trained using
Focal Loss, a standard
loss function for mitigating extreme class imbalance in single stage
detectors. As demonstrated in the Figure below, this loss is applied to
predictions for “positive” and “negative” anchors. The loss is not applied
to predictions for “unassigned” anchors — these values are masked during
training because they provide ambiguous signals to the model.
We realized that, because the detector is class-aware (as opposed to the
more common class-agnostic Region Proposal Network in two-stage detectors),
we can use a similar loss masking scheme to deal with the taxonomy dilemma.
In addition to masking loss coming from anchors (rows) based on IoU, we mask
loss coming from target object classes (columns) if the class is “missing"
in the corresponding customer-taxonomy. If a training example comes from a
customer-taxonomy that doesn’t include "Vehicle," we know "Vehicle" objects
are not labeled so the probability of "Vehicle" for every anchor box is
ignored and the corresponding loss values are masked. We define a “missing
label mask” ∈ (0,1)K for each customer-taxonomy that tells us
which classes (columns) to mask when training on an example from that
customer-taxonomy. During inference, the model makes predictions for all
classes. We call this Taxonomy Loss Masking.
Left: Masking in Focal Loss. We depict P ∈ RAxK, the matrix of
probabilities predicted by the classification head where Pij is
the probability that there is an instance of class j at anchor i. We show
the anchors (rows) sorted by IoU with the nearest ground truth annotation:
green rows have IoU >= 0.5 and red rows have IoU less than 0.4. Focal Loss
applies to these “positive” and “negative” anchors respectively. However,
anchors with IoU between [0.4, 0.5) are neither positive nor negative.
They are “unassigned” and their loss is masked. Right: Taxonomy Loss
Masking. When training on an example from a customer-taxonomy which only
labels Traffic Light and Person, the loss for all other classes (columns)
is masked.
In essence, Taxonomy Loss Masking is multitasking across super-taxonomy
classes without any task-specific parameters — using a fully-shared
architecture, and only altering the loss function. This approach leverages
maximum information about our priors: the taxonomies are closely related,
some taxonomies are systematically “missing labels,” and the model should be
able to share features across classes. Not only is this approach
simpler than the previous two, but it also yields
excellent performance, it’s *cheaper, *and it’s
more scalable.
Although we primarily focused on 2D object detection for the domain of
driving images, we can use variants of Taxonomy Loss Masking across domains
and across task types including 3D object detection, 2D/3D semantic
segmentation, etc.
In Case You're Wondering
“The model only predicts labels in the super-taxonomy, not the actual
customer-taxonomies. Isn’t this incomplete?”
Good observation. We can predict labels in customer-taxonomies by using a
hierarchy of classifiers and a hierarchical super-taxonomy. Each level of
the classifier hierarchy will make a more granular prediction in the
super-taxonomy and use a different “missing label mask.” The classifiers can
even predict object attributes: e.g. whether a vehicle is parked or occluded
(Scale AI offers attribute labeling too)!
But remember, our models are used to augment and accelerate our Taskers. For
Taskers doing object detection, the most difficult part of labeling is
detection. Classification is very easy because it’s just a multiple choice
question. Although we try, we don’t actually need to solve the entire
problem; we only need to focus on the expensive part — detection. I found
this distinction particularly interesting and important during my
internship.
“Are there other differences in customer taxonomies that you have to
deal with?
Some customers have different labeling rules. For example, some want
bounding boxes for vehicles to include the side mirrors while others want
just the main body of the vehicle. We can address this issue by weighting
the regression loss based on labeling rules. Let’s say we want the model to
include side mirrors. If we’re training on data from Taxonomy A which
includes side mirrors and Taxonomy B which doesn’t, we’ll weight the
regression loss higher for instances from Taxonomy A. Another way to teach
the model to include side mirrors is to oversample data from Taxonomy A
during training, or to fine-tune the model on Taxonomy A after training.
Differences in labeling rules. The left bounding box includes side
mirrors. The right does not.
Operation Vacation
Scale’s machine learning models supercharge our data annotation cycle. As
long as our customers continue sending us data and our Taskers continue
labeling, our models will continue improving, accelerating the labeling
process, and perpetuating this virtuous cycle.
As we mentioned earlier, it is important that our models are trained on as
much data as possible. The quantity of our
training data grows along two primary dimensions - the number of
customer datasets and the time we spend labeling these datasets. Taxonomy
Loss Masking enabled us to scale our model’s training data across the
customer dimension. Since our Taskers are continuously labeling
data and the size of our datasets grows over time, it’s important that we
also scale across the time dimension. In other words, we should
continue training our models as we get more labeled data.
In a
recent collaboration with PyTorch, Scaliens
Daniel Havir and
demonstrated how we use asynchronous data streaming to train on large,
growing datasets. This technique, along with innovative cloud infrastructure
and distributed training, enables us to automatically train models as we
accumulate more data. We use this system to push the limits of model
performance for both training on super-taxonomies, and fine-tuning on
individual customer datasets.
Daniel and Nathan also demonstrated how we use hashing to achieve a
consistent train/test split when we have a growing dataset. This means that
our train and test sets should grow at the same rate. Holding the test set
constant, we’d expect increased model performance as we continue training on
a growing train set. On the other hand, holding the model constant, we can
examine performance on cohorts of older vs. newer examples in our test set.
Changes in model performance along this dimension indicate drift in the data
distribution over time, meaning that our customer is sending us different
data. This is an example of how we can automatically monitor model
performance.
In the same vein as Andrej Karpathy’s “Operation Vacation”, we use these training and monitoring systems to automate the model
learning process. This saves engineering hours and enables our Machine
Learning team to focus on finding new ways to enhance our annotation
pipeline with ML. As long as our customers continue sending us data and our
Taskers continue labeling, we can “take a vacation.” Our models will
continue improving, accelerating the labeling process, and perpetuating this
virtuous cycle.
Ultimately, scalable machine learning helps us continuously improve our
labeling quality and efficiency while striving toward our mission: “To
accelerate the development of AI by democratizing access to intelligent
data.”