Engineering

How we Scale Machine Learning

byon May 18, 2020

Introduction



At Scale AI, we label on the order of 10MM annotations per week. To deliver

high-quality annotations for this enormous volume of data, we’ve developed a

number of techniques including advanced

sensor fusion to

provide rich detail about complex environments,

active tooling to

accelerate the labeling process, and

automated benchmarks

to measure and maintain labeler (Tasker) quality. As we work with more

customers, more Taskers, and more data, we continue to refine these methods to

improve our labeling quality, efficiency, and scalability.

How we use ML



While this vast quantity of data provides Scale AI with invaluable

opportunities to learn and to build upon our annotation processes, it also

enables our Machine Learning team to train models that further augment our

capabilities. We leverage ML models throughout our annotation pipeline

including:

  1. Pre-labeling to reduce Taskers’ manual work


  1. Active tooling to maximize Taskers’ speed & efficiency


  1. ML Linters to check Taskers’ annotations for potential errors


We have observed that ML models offer orders-of-magnitude improvement for

every component of our annotation pipeline. For our customers, this equates

to higher quality, higher throughput, and lower prices. Because ML has such

high potential impact, it is important that our models are not only

highly-performant, but also highly-scalable. This means we have to be very

intentional about how we scale machine learning across our many customers,

task types, and datasets.

Challenges of Vast and Varied Data


The naive approach is to train a model for each customer dataset. But this

doesn’t scale. It would be very expensive for an ML engineer to train and

deploy a new model whenever a customer wants a new dataset. It can also take

significant “ramp time” before we’ve even labeled enough data to train a

useful model. More importantly, research from

Google AI has demonstrated

that “model performance increases logarithmically based on volume of

training data.” Scale AI has a vast and ever-growing quantity of labeled

data. This means training separate models, each on a small fraction of the

available data, would waste enormous potential performance gains.

If we train a model per dataset, we need to wait for the “ramp time” to

annotate a sufficiently large training set before we can actually leverage

ML. Training models across datasets increases performance, improves

generalization, and eliminates this ramp time.


To effectively leverage our data advantage, we need to train models across

customers’ datasets. Doing so reduces the cost of developing a custom model

per dataset, eliminates the ramp time needed to train a useful model, and

drastically improves model performance. Ultimately, this enhances the

product for our customers. However, using models across customer datasets is

complicated because each dataset is unique. The two primary differences

between customer datasets are the data domain and label taxonomy.

Data Domain

Images from different datasets (COCO, Mapillary,

KITTI,

NuScenes,

Waterloo)

highlighting domain differences.


In the context of machine learning, a domain is the underlying distribution

your data is sampled from. For example, within the general domain of driving

images, there are a number of factors that affect the data distribution such

as:


  1. Weather and lighting conditions. (E.g. day vs. night, snow vs. rain, sunny
  2. vs. overcast)

  3. The type of sensor. This can affect image perspective, object dimensions,
  4. color contrast, etc. Cameras at the front of a car will capture different
  5. data than cameras on the sides.

  6. The location. An urban environment will feature more occluded objects than
  7. a highway. Motorcycles will appear more frequently in Thailand than in
  8. Germany.


Models perform best when the training data distribution is representative of

the target distribution. Ideally, the training and target data come from the

same domain. However, our customers’ datasets exhibit all these variations

and more.

Label Taxonomy

NuScenes image labeled according to three different customer-taxonomies

exhibits naming and missing label issues.


Even if the training data comes from the same underlying distribution as the

target data, the training labels must also be representative of the target

labels. However, all of our customers want their data labeled differently.

For a given dataset, the customer defines a set of rules specifying which

objects to label, what names to give them, how tight to make the bounding

boxes, etc. A taxonomy is the combination of all these rules.

Variations among customer-taxonomies pose several challenges for training

models.


The most obvious issue is that customer-taxonomies use different names for

labeling the same object classes. For example, some customers may label any

person as "Human" while others distinguish between "Pedestrianadult" and "Pedestrianchild." Over the course of my internship I’ve learned more nuanced labels

for people, vehicles, and signs than I’d like to admit.


We face an even greater challenge when customer-taxonomies label different

sets of object classes. For example, in the figure above, Taxonomy A only

labels people, Taxonomy C only labels vehicles, and Taxonomy B labels both

people and vehicles. When training an object detector for people and

vehicles, we’d want to use data from all three taxonomies to maximize

performance. However, images from Taxonomy A with unlabeled vehicles will

encourage the model not to detect vehicles while images from Taxonomy C with

unlabeled people will encourage the model not to detect people. This issue

is particularly problematic because these “missing labels” are systematic.

As we’ve mentioned in

Quantity is no Panacea, random labeling errors may be mitigated by increasing the quantity of

training data, but systematic errors will severely impact model performance.

Tackling the Taxonomy Dilemma


In this section, we’ll walk you through the steps we took to overcome these

challenges. Training across customer datasets should provide our models with

a large, varied training set to cope with the differences in data domain.

But we need some more innovative ideas to cope with differences in taxonomy.

For simplicity, we’ll focus on 2D object detection and we’ll use SOTA object

detection model

EfficientDet recently

released by Google Brain. We’ll detail three different approaches we’ve

explored to tackle the taxonomy dilemma — using the model architecture, the

dataset, and the loss function. As you read on, you’ll see why modifying the

loss function via Taxonomy Loss Masking yields the best solution.

Approach 1: Separate Tasks

EfficientDet backbone with three pairs of taxonomy-specific classification

and regression heads (modified from

EfficientDet).


Our initial approach takes the perspective of multitask learning — we treat

prediction for each customer-taxonomy as a separate task. In this setting,

we have a shared backbone which generates a taxonomy-agnostic feature

embedding, and a pair of classification and regression heads for each of the

target customer-taxonomies (Figure above). To distinguish differences in

taxonomies, the taxonomy-specific heads are only trained on examples labeled

according to that customer-taxonomy. For training/inference on an example

from customer dataset A, the Taxonomy A classification and regression heads

perform the prediction.


Unfortunately, like the naive approach training a model for each dataset,

this multitask approach requires a “ramp time” before we’ve annotated enough

examples to train a new taxonomy-specific head. Additionally, the cost of

this approach scales with the number of customer-taxonomies because we have

to train a pair of heads for each. Furthermore, this multitask approach is

stymied by the classic problem in multitask learning where each head “fights

for capacity” (Andrej Karpathy, ICML 2019). There exist complex correlations between tasks so the backbone struggles

to learn a representation that satisfies them all. We observe this phenomena

as our model has only mediocre performance.


This first approach uses the model architecture, namely class-specific

heads, to cope with differences in customer-taxonomies. However, we neglect

to tell our model crucial information: customer-taxonomies are closely

related.

Super Taxonomy


The next two approaches rely on the notion of a “super-taxonomy.” Instead of

directly distinguishing between customer-taxonomies, we can define a general

“super-taxonomy” that encompasses them all. Then, we define a mapping from

labels in the customer-taxonomies to labels in the super-taxonomy. This

allows us to encode our priors about relationships between the

customer-taxonomies by treating each as a subset of the super-taxonomy.

The mapping from customer-taxonomies to the super taxonomy.


By mapping customer labels to labels in the super-taxonomy, the individual

customer datasets can be combined into a “super-dataset” with a consistent

naming scheme. However, the super-dataset suffers from the systematic

'missing label issue’ we discussed in the Label Taxonomy section.


As a quick baseline, we tried training directly on this super-dataset. We

were optimistic because we read that “(randomly) dropping 30% of the

annotations… only drops (performance) by 5% on the PASCAL VOC dataset” (Soft Sampling for Robust Object Detection). But even after adding extra tricks like

Background Recalibration Loss, the resulting model had abysmal recall — likely because our missing

labels are systematic, not random.

Approach 2: Separate Datasets


In this second approach, we address the missing label problem by dividing

the super-dataset into separate class-specific super-datasets, one for each

label in the super-taxonomy. As shown in the Figure below, we eliminate the

missing label issue in each of the class-specific super-datasets by

including only customer-taxonomies which label the corresponding class. This

enables us to train a class-specific model for each class-specific

super-dataset. For inference, we simply run all of these class-specific

models on an image and combine their outputs.

Class-specific super-datasets only include customer taxonomies which label

the corresponding class. We train a single class-specific model for each

class-specific super-dataset. All of these models are combined for

inference.


This approach yields good performance because labels within each dataset are

consistent and each model learns a good general representation for its

class. Using so many models would be impractical for most of our

self-driving customers due to the constrained computational resources and

strict latency requirements associated with perception. However, these

limitations don’t apply to Scale AI because labeling operates on a much

longer timescale than perception. Using separate models has the added

benefit of decoupling performance across classes — we can re-train our

Vehicle model without affecting the performance on any other class.


In this approach, the cost of training and inference scales with the number

of classes in our super-taxonomy, rather than the number of

customer-taxonomies. This is a marked improvement over the multitasking

approach if we have a small super-taxonomy. However, our super-taxonomies

are often fairly large. This approach is undesirable because it requires

having many separate models that should theoretically be able to share

features. We’d prefer to use just a single model.


Approach 3: Separate Loss - Taxonomy Loss Masking


What if, instead of telling our model about differences in

customer-taxonomies through the architecture or through separate datasets,

we make this distinction in the loss function? To answer this question,

let’s take a closer look at how our model is trained.

Focal Loss mitigates

class imbalance by re-weighting binary cross entropy loss.


Single stage detectors, like EfficientDet, detect objects by making

predictions at many predefined anchor boxes. Let A be the number of anchor

boxes and K be the number of target object classes. The EfficientDet

classifier head predicts a matrix of probabilities P ∈ RAxK,

where Pij is the probability that there is an instance of class j

at anchor i. The model is trained using

Focal Loss, a standard

loss function for mitigating extreme class imbalance in single stage

detectors. As demonstrated in the Figure below, this loss is applied to

predictions for “positive” and “negative” anchors. The loss is not applied

to predictions for “unassigned” anchors — these values are masked during

training because they provide ambiguous signals to the model.


We realized that, because the detector is class-aware (as opposed to the

more common class-agnostic Region Proposal Network in two-stage detectors),

we can use a similar loss masking scheme to deal with the taxonomy dilemma.

In addition to masking loss coming from anchors (rows) based on IoU, we mask

loss coming from target object classes (columns) if the class is “missing"

in the corresponding customer-taxonomy. If a training example comes from a

customer-taxonomy that doesn’t include "Vehicle," we know "Vehicle" objects

are not labeled so the probability of "Vehicle" for every anchor box is

ignored and the corresponding loss values are masked. We define a “missing

label mask” ∈ (0,1)K for each customer-taxonomy that tells us

which classes (columns) to mask when training on an example from that

customer-taxonomy. During inference, the model makes predictions for all

classes. We call this Taxonomy Loss Masking.

Left: Masking in Focal Loss. We depict P ∈ RAxK, the matrix of

probabilities predicted by the classification head where Pij is

the probability that there is an instance of class j at anchor i. We show

the anchors (rows) sorted by IoU with the nearest ground truth annotation:

green rows have IoU >= 0.5 and red rows have IoU less than 0.4. Focal Loss

applies to these “positive” and “negative” anchors respectively. However,

anchors with IoU between [0.4, 0.5) are neither positive nor negative.

They are “unassigned” and their loss is masked. Right: Taxonomy Loss

Masking. When training on an example from a customer-taxonomy which only

labels Traffic Light and Person, the loss for all other classes (columns)

is masked.


In essence, Taxonomy Loss Masking is multitasking across super-taxonomy

classes without any task-specific parameters — using a fully-shared

architecture, and only altering the loss function. This approach leverages

maximum information about our priors: the taxonomies are closely related,

some taxonomies are systematically “missing labels,” and the model should be

able to share features across classes. Not only is this approach

simpler than the previous two, but it also yields

excellent performance, it’s *cheaper, *and it’s

more scalable.


Although we primarily focused on 2D object detection for the domain of

driving images, we can use variants of Taxonomy Loss Masking across domains

and across task types including 3D object detection, 2D/3D semantic

segmentation, etc.

In Case You're Wondering

“The model only predicts labels in the super-taxonomy, not the actual

customer-taxonomies. Isn’t this incomplete?”

Good observation. We can predict labels in customer-taxonomies by using a

hierarchy of classifiers and a hierarchical super-taxonomy. Each level of

the classifier hierarchy will make a more granular prediction in the

super-taxonomy and use a different “missing label mask.” The classifiers can

even predict object attributes: e.g. whether a vehicle is parked or occluded

(Scale AI offers attribute labeling too)!


But remember, our models are used to augment and accelerate our Taskers. For

Taskers doing object detection, the most difficult part of labeling is

detection. Classification is very easy because it’s just a multiple choice

question. Although we try, we don’t actually need to solve the entire

problem; we only need to focus on the expensive part — detection. I found

this distinction particularly interesting and important during my

internship.

“Are there other differences in customer taxonomies that you have to

deal with?

Some customers have different labeling rules. For example, some want

bounding boxes for vehicles to include the side mirrors while others want

just the main body of the vehicle. We can address this issue by weighting

the regression loss based on labeling rules. Let’s say we want the model to

include side mirrors. If we’re training on data from Taxonomy A which

includes side mirrors and Taxonomy B which doesn’t, we’ll weight the

regression loss higher for instances from Taxonomy A. Another way to teach

the model to include side mirrors is to oversample data from Taxonomy A

during training, or to fine-tune the model on Taxonomy A after training.

Differences in labeling rules. The left bounding box includes side

mirrors. The right does not.

Operation Vacation

Scale’s machine learning models supercharge our data annotation cycle. As

long as our customers continue sending us data and our Taskers continue

labeling, our models will continue improving, accelerating the labeling

process, and perpetuating this virtuous cycle.


As we mentioned earlier, it is important that our models are trained on as

much data as possible. The quantity of our

training data grows along two primary dimensions - the number of

customer datasets and the time we spend labeling these datasets. Taxonomy

Loss Masking enabled us to scale our model’s training data across the

customer dimension. Since our Taskers are continuously labeling

data and the size of our datasets grows over time, it’s important that we

also scale across the time dimension. In other words, we should

continue training our models as we get more labeled data.


In a

recent collaboration with PyTorch, Scaliens

Daniel Havir and

Nathan Hayflick

demonstrated how we use asynchronous data streaming to train on large,

growing datasets. This technique, along with innovative cloud infrastructure

and distributed training, enables us to automatically train models as we

accumulate more data. We use this system to push the limits of model

performance for both training on super-taxonomies, and fine-tuning on

individual customer datasets.


Daniel and Nathan also demonstrated how we use hashing to achieve a

consistent train/test split when we have a growing dataset. This means that

our train and test sets should grow at the same rate. Holding the test set

constant, we’d expect increased model performance as we continue training on

a growing train set. On the other hand, holding the model constant, we can

examine performance on cohorts of older vs. newer examples in our test set.

Changes in model performance along this dimension indicate drift in the data

distribution over time, meaning that our customer is sending us different

data. This is an example of how we can automatically monitor model

performance.


In the same vein as Andrej Karpathy’s “Operation Vacation”, we use these training and monitoring systems to automate the model

learning process. This saves engineering hours and enables our Machine

Learning team to focus on finding new ways to enhance our annotation

pipeline with ML. As long as our customers continue sending us data and our

Taskers continue labeling, we can “take a vacation.” Our models will

continue improving, accelerating the labeling process, and perpetuating this

virtuous cycle.


Ultimately, scalable machine learning helps us continuously improve our

labeling quality and efficiency while striving toward our mission: “To

accelerate the development of AI by democratizing access to intelligent

data.”


The future of your industry starts here.