Engineering

Faster Semantic Segmentation with Machine Learning-Assisted Workflows


Why build an ML-based workflow for data labeling?


Delivering high-quality data labels quickly is a challenging task. Even for

humans, image and video data commonly found in self-driving or robotic

applications can be compositionally challenging to parse and correctly

annotate. At Scale, we use machine learning to augment human workflows to both

improve the quality and speed of labeling. Since deep learning models can

struggle to perform consistently across diverse data domains like self-driving

car scenes, finding the optimal balance between ML automation and human

supervision is necessary to ensure we maintain high-quality.

Autosegment, an ML-powered tool that can accurately segment instances

within a human-defined bounding box, allows us to delegate high-level

reasoning about semantics and taxonomy to Taskers (human labelers) and basic

pixel-level processing to an ML model. This division of labor enables us to

build a general purpose and accurate ML labeling tool. Despite using a simple

neural network, Autosegment outperforms prior tools.


The Challenge of Semantic Segmentation Labeling


Semantic segmentation involves human Taskers identifying and labeling every

pixel in a visual scene — a notoriously time-consuming and tedious process.

Most Tasker time in this manual approach — about 95% — is consumed by

time-intensive work, such as carefully tracing the outlines of objects in a

scene. For example: it takes roughly one minute to trace the outline of a car

and only three seconds to click on that outline and select the label “Car.”

What if we could instead use machine learning to automate the most laborious

parts of semantic segmentation? What if an ML model did the semantic

segmentation, and the Tasker only had to select the right label?

Semantic Segmentation Task

A visual scene which has been semantically segmented. Each pixel is color

coded to denote the class of object it represents, i.e. Pedestrian, Road,

Bike, etc.



As ML engineers, we rely heavily on quantitative validation metrics, such as

label accuracy and intersection-over-union (IoU), to determine the quality

of our models. However, when building models that optimize for a particular

business outcome — for instance, reducing the time it takes Taskers to

semantically segment images — we need to think critically about which model

to use and how we measure its performance..


In this example, our objective is to produce predictions that streamline the

Taskers’ job. This is not necessarily the same as “best model performance.”

For example, an instance segmentation model might be consistently inaccurate

by just a few pixels around an object. This model might have a high IoU

score but very little practical usefulness because Taskers would need to

re-do every trace of the outline.


Conversely, a model that can perfectly segment an object, even if it only

identifies half of it, would still be extremely useful. Despite having lower

IoU validation scores, it’s still an improvement to the overall workflow.

Because it takes so long to trace outlines manually, having a model produce

pixel-accurate partial outlines is still a significant time saver.


The Active Tooling Paradigm and the Autosegment Tool

Autosegment is a machine learning tool that segments the most

salient object within a human-provided box annotation. It incorporates

Tasker feedback at annotation time in a way that is flexible and iterative.


When building Autosegment, we realized early on that the only way

to provide flexibility and high quality was to give Taskers access to Active

Tools.


Active Tools let Taskers make changes to the model inputs and parameters in

real time to generate the best output for their use case. This is similar to

“prompt engineering,” the process by which we can coerce a neural language

model like GPT-3 to generate desired text when given precise prompts. For

more information, OpenAI hosts

excellent documentation

on how to design prompts for their GPT-3 model.


Open-AI GPT-3 Model

Humans working alongside models can produce better results faster than

models alone



Using Active Tools, we give Taskers as many options as possible to make

their tasks easier. For example, they can adjust the annotation box shape

and the prediction confidence threshold to configure the model to speed up

their particular labeling task.


Autosegment Tool DemoA demo of the Autosegment tool.



Enabling humans to freely explore the output space of the model is a key

advantage of active tooling. We've already seen how this is useful.


Overall, we have observed the median annotation time per scene is about 30%

faster with Autosegment. We also interviewed Taskers to see how

they are using the tool. While our model is trained to segment cars and

motorcycles, one Tasker used the tool to segment everything. While this

wasn’t a workflow we had envisioned, it halved their time per scene

segmentation, from 40 minutes to 20 minutes. In a discussion we had with

five Taskers, they explained that they consider Autosegment to be more

accurate and efficient than previous ML solutions. Now we can simply supply

robust and general online models and let Taskers optimize the rest.

The Autosegment Model

Autosegment is a convolutional neural network trained to segment

the most salient instance within a Tasker-defined bounding box. This simple

design allows us to significantly reduce our customers’ operational overhead

to deploy ML, while providing pixel-perfect instance segmentations, even on

previously unseen instance types.


The Autosegment model does not have to be particularly complex to

produce compelling results. Autosegment has no notion of different semantic

types; instead, it abstractly identifies and segments based on scene

composition.


The model is trained to always generate a segmentation. The training data

includes a few key classes like pedestrians and vehicles. The task is

binary; it simply determines whether each pixel belongs to the salient

object or not. We don’t train the model with negative samples. Instead, the

model makes its best guess on each pixel to generate a segmentation.


We define the salient instance as the object that owns the central pixel of

the crop, so even if the scene is cluttered, the model can identify the

target instance to segment. This binary segmentation task across instance

classes and types allows the model to generalize well on new classes and

taxonomies. Indeed, we can transfer not only between objects of the same

class, such as cars in different countries, but between different classes

altogether — for instance, moving from cars to motorcycles to street signs.

This transfer can be achieved with limited retraining or, in some cases, no

retraining at all, achieving zero-shot transfer.




Using salient object detection on specific crops of the scene can generate

even more accurate segmentation. We were able to double the edge precision

score from 0.26 to 0.52 when compared to off-the-shelf full-scene semantic

segmentation models.


Instead of supplying a downsampled version of the whole image to a neural

network, the selected region is fed to the model at full resolution. Because

the input is smaller, fewer pooling operations are required to achieve a

globally receptive field. As a result, there is less downsampling and

blurring. Our model achieves better fidelity around the edges of instances,

a crucial part of providing high-quality data annotations to our customers.

Improving full-scene labeling to the same degree would require exponentially

more effort and complexity.


In summary, we developed Active Tooling to create an ML-based workflow

capable of solving basic computer vision problems. This workflow helps our

Taskers reach higher levels of quality faster than in previous workflows.


Innovations like Autosegment are critical to enable the “Moore's

Law of data labeling.” At Scale AI, we’re excited to push the boundaries of

human-AI interaction for our customers.


The future of your industry starts here.