Delivering high-quality data labels quickly is a challenging task. Even for humans, image and video data commonly found in self-driving or robotic applications can be compositionally challenging to parse and correctly annotate. At Scale, we use machine learning to augment human workflows to both improve the quality and speed of labeling. Since deep learning models can struggle to perform consistently across diverse data domains like self-driving car scenes, finding the optimal balance between ML automation and human supervision is necessary to ensure we maintain high-quality.
Autosegment, an ML-powered tool that can accurately segment instances within a human-defined bounding box, allows us to delegate high-level reasoning about semantics and taxonomy to Taskers (human labelers) and basic pixel-level processing to an ML model. This division of labor enables us to build a general purpose and accurate ML labeling tool. Despite using a simple neural network, Autosegment outperforms prior tools.
The Challenge of Semantic Segmentation Labeling
Semantic segmentation involves human Taskers identifying and labeling every pixel in a visual scene — a notoriously time-consuming and tedious process. Most Tasker time in this manual approach — about 95% — is consumed by time-intensive work, such as carefully tracing the outlines of objects in a scene. For example: it takes roughly one minute to trace the outline of a car and only three seconds to click on that outline and select the label “Car.” What if we could instead use machine learning to automate the most laborious parts of semantic segmentation? What if an ML model did the semantic segmentation, and the Tasker only had to select the right label?
As ML engineers, we rely heavily on quantitative validation metrics, such as label accuracy and intersection-over-union (IoU), to determine the quality of our models. However, when building models that optimize for a particular business outcome — for instance, reducing the time it takes Taskers to semantically segment images — we need to think critically about which model to use and how we measure its performance..
In this example, our objective is to produce predictions that streamline the Taskers’ job. This is not necessarily the same as “best model performance.” For example, an instance segmentation model might be consistently inaccurate by just a few pixels around an object. This model might have a high IoU score but very little practical usefulness because Taskers would need to re-do every trace of the outline.
Conversely, a model that can perfectly segment an object, even if it only identifies half of it, would still be extremely useful. Despite having lower IoU validation scores, it’s still an improvement to the overall workflow. Because it takes so long to trace outlines manually, having a model produce pixel-accurate partial outlines is still a significant time saver.
The Active Tooling Paradigm and the Autosegment Tool
Autosegment is a machine learning tool that segments the most salient object within a human-provided box annotation. It incorporates Tasker feedback at annotation time in a way that is flexible and iterative.
When building Autosegment, we realized early on that the only way to provide flexibility and high quality was to give Taskers access to Active Tools.
Active Tools let Taskers make changes to the model inputs and parameters in real time to generate the best output for their use case. This is similar to “prompt engineering,” the process by which we can coerce a neural language model like GPT-3 to generate desired text when given precise prompts. For more information, OpenAI hosts excellent documentation on how to design prompts for their GPT-3 model.
Using Active Tools, we give Taskers as many options as possible to make their tasks easier. For example, they can adjust the annotation box shape and the prediction confidence threshold to configure the model to speed up their particular labeling task.
Enabling humans to freely explore the output space of the model is a key advantage of active tooling. We've already seen how this is useful.
Overall, we have observed the median annotation time per scene is about 30% faster with Autosegment. We also interviewed Taskers to see how they are using the tool. While our model is trained to segment cars and motorcycles, one Tasker used the tool to segment everything. While this wasn’t a workflow we had envisioned, it halved their time per scene segmentation, from 40 minutes to 20 minutes. In a discussion we had with five Taskers, they explained that they consider Autosegment to be more accurate and efficient than previous ML solutions. Now we can simply supply robust and general online models and let Taskers optimize the rest.
The Autosegment Model
Autosegment is a convolutional neural network trained to segment the most salient instance within a Tasker-defined bounding box. This simple design allows us to significantly reduce our customers’ operational overhead to deploy ML, while providing pixel-perfect instance segmentations, even on previously unseen instance types.
The Autosegment model does not have to be particularly complex to produce compelling results. Autosegment has no notion of different semantic types; instead, it abstractly identifies and segments based on scene composition.
The model is trained to always generate a segmentation. The training data includes a few key classes like pedestrians and vehicles. The task is binary; it simply determines whether each pixel belongs to the salient object or not. We don’t train the model with negative samples. Instead, the model makes its best guess on each pixel to generate a segmentation.
We define the salient instance as the object that owns the central pixel of the crop, so even if the scene is cluttered, the model can identify the target instance to segment. This binary segmentation task across instance classes and types allows the model to generalize well on new classes and taxonomies. Indeed, we can transfer not only between objects of the same class, such as cars in different countries, but between different classes altogether — for instance, moving from cars to motorcycles to street signs. This transfer can be achieved with limited retraining or, in some cases, no retraining at all, achieving zero-shot transfer.
Using salient object detection on specific crops of the scene can generate even more accurate segmentation. We were able to double the edge precision score from 0.26 to 0.52 when compared to off-the-shelf full-scene semantic segmentation models.
Instead of supplying a downsampled version of the whole image to a neural network, the selected region is fed to the model at full resolution. Because the input is smaller, fewer pooling operations are required to achieve a globally receptive field. As a result, there is less downsampling and blurring. Our model achieves better fidelity around the edges of instances, a crucial part of providing high-quality data annotations to our customers. Improving full-scene labeling to the same degree would require exponentially more effort and complexity.
In summary, we developed Active Tooling to create an ML-based workflow capable of solving basic computer vision problems. This workflow helps our Taskers reach higher levels of quality faster than in previous workflows.
Innovations like Autosegment are critical to enable the “Moore's Law of data labeling.” At Scale AI, we’re excited to push the boundaries of human-AI interaction for our customers.