Within the field of computer vision, semantic segmentation is a close approximation to describing the visual world the way humans perceive it: entities with precise boundaries. For use cases where nuanced scene understanding is mission-critical, like autonomous vehicles and intelligent robots, building models requires massive amounts of highly accurate segmentation masks with pixel-level precision.
The exactness required near object edges that makes semantic segmentation valuable also makes it incredibly time-consuming for humans to annotate. Given taskers annotate thousands of images per day, we wanted to accelerate the annotation process without losing fine-grain precision. To accomplish this, we developed a feature called Autosegment. An annotator draws a box around an object and a segmentation mask is automatically generated. The annotator can then adjust the mask before marking it as complete. Under the hood, the subset of the image is sent through a model, which automatically identifies and segments the object of interest.
The Autosegment functionality was originally developed for autonomous vehicle data, which is a core customer use case at Scale. (You can read more about our original Autosegment tool in our previous blog post.) When it was first launched, we saw a 30% decrease in the time it took to produce segmentation masks. However, the use cases for Autosegment have expanded greatly since its introduction, with AV data now constituting only one of many domains for which Scale produces segmentation data. While the underlying model did generalize beyond the autonomous vehicle domain, we observed that it performed noticeably better on objects like street signs, vehicles, and pedestrians than on other types of classes.
This article contains a technical deep dive into the machine learning model that powers Autosegment. We’ll walk through some of the choices we made during the development process as well as our recent effort to overhaul the model for improved generalizability. In particular, we will discuss the data-centric approach we took to boost the usefulness of Autosegment across a wider range of use cases.
The segmentation model developed to power this tool addresses a problem known as salient object detection wherein the goal is to identify and segment the most “salient” or prominent object in an image. There is subjectivity in this task because an image can easily contain more than one prominent item. The intention behind Autosegment is that the user draws an approximate bounding box around the object of interest. Based on this we can firmly make two assumptions:
- The target object will be relatively centered within the box
- The target object will extend nearly to the edges of the box, with some nonzero padding
We encode these assumptions into the training process. Specifically, images in the training set are cropped in such a way that the target object is always centered with some randomly added padding. For any training image containing multiple segmented objects, each object is cropped separately to produce an individual training observation.
The architecture we chose to implement for this purpose is based on U2-Net proposed by Qin et al. for this type of problem. U2-Net provides two concrete benefits over other architectures we considered.
First, it is designed to more effectively take into account both local and global contrast features of the image. Most of the popular feature extractors (VGG, ResNet, etc.) were designed and optimized for image classification. While they can work well for segmentation tasks, they do so by identifying features specific to the semantic classes they are trying to predict. The consequence of this is that they tend not to generalize as well to classes absent in the training set. On the other hand, U2-Net is designed to make use of image features that indicate object saliency. Because of this, while U2-Net may not perform as well at outright classification as more traditional architectures (we did not test this), in our testing it does a good job at identifying salient objects pertaining to obscure classes. This is important for Scale’s customers who often need to annotate data for niche use cases not represented in the training data.
Second, we want to carefully avoid the need to stretch or downsample images. Because the goal is pixel-perfect accuracy, the model needs to output a high-resolution segmentation mask. Downsampling, stretching, or otherwise modifying the image in any information-reducing manner before feeding it to the model will typically result in blurrier segmentation masks than desired. To this end, utilizing a fully convolutional network allows for the processing of any image crop size or shape as long as it is large enough to be processed by the encoder phase of the U-Net (and small enough to fit within GPU memory). In addition, unlike some other architectures, U2-Net does not excessively downsample the input with convolution strides or pooling. This helps propagate enough information about the original image through the network to produce high-resolution mask outputs.
Neural networks have a tendency to try and “cheat” by taking shortcuts when learning to solve a problem. Faced with the task of salient object detection, one shortcut is to simply learn how to segment the specific classes present in the training data. For example, if trained exclusively on autonomous vehicle camera data, the Autosegment model only needs to memorize what certain object classes look like, such as cars, trucks, and pedestrians. The resulting model may work for annotating vehicle scenes, but it will not perform well when applied to new object classes because it hasn’t truly learned the concept of “saliency.”
To overcome this challenge, the model needs to be given training examples for which this shortcut does not work. In the case of Autosegment, we expanded the training dataset to include hundreds of new object types. Neural networks have a finite number of parameters, meaning that there is a limit to how much “knowledge” they can hold. By providing more classes than the network can effectively memorize, the hope is that it will be forced to learn a simpler strategy that can detect saliency across a more general set of images.
Our first step towards generalization was to incorporate the open-source Common Objects In Context (COCO) segmentation dataset into the training process. However, while COCO contains a wide variety of object classes, it only contains about 200 thousand annotated images, which is a small fraction of the training data from autonomous vehicle cameras. In addition, COCO segmentation masks are often somewhat imprecise.
We then sought to leverage the vast segmentation data that has been produced using Rapid, our annotation platform for rapid experimentation. Rapid supports both full-scene segmentation and partial-scene segmentation. In full-scene semantic segmentation, every pixel in the image is classified into one of multiple classes. In partial-scene segmentation, pixels may be categorized into zero or more classes. Hundreds of thousands of images have been annotated using Rapid.
Full-scene segmentation often contains classes that are not beneficial to train on. For example, in outdoor scenes the sky is often a distinct class in the segmentation mask but is typically considered less salient than all other objects in the image. While we want to train on a diverse set of object classes, we do not want to train on “background classes” or objects that do not match a reasonable definition of saliency. Neural networks are generally robust to occasional adverse training samples, but in large quantities they can negatively affect training. In order to filter out these undesirable classes, we have taken the time to construct a mapping between a common taxonomy and the customer-specific classes. For example, this mapping helps us recognize that what one customer calls a “bicycle” is the same as what another customer calls a “bike.” From this, we filter out segmentations that do not match a list of approved classes. Partial-scene segmentation is the more common project type on Rapid and mostly avoids this problem of extraneous classes.
An important consideration for us was that because Autosegment is already in use across a large number of projects and use cases, changing the underlying model risks unknowingly worsening its accuracy on specific categories of images. In theory this type of accuracy loss could occur due to the finite information capacity of the model architecture. As mentioned earlier, the network needs to encode its problem-solving process using a limited set of parameters. If the number of parameters is too small, underfitting will occur and impede accuracy. We combat this possibility in two ways.
First, we ensure that the training data is well-balanced. If the overall composition of the dataset is skewed in favor of any particular domain, then the model may tend to overemphasize training samples from that domain. Not only can this result in lower accuracy in other domains, but it could impede our goal of forcing the model to learn a generalized solution to the salient object detection problem.
Second, in addition to benchmarking accuracy on the overall collection of images, we also benchmark the model’s performance on subdomains (such as autonomous vehicle data) to ensure it performs well within each of the strata. This essentially serves as “regression testing” for the model. Regression testing is a software development practice to ensure that code changes do not break existing known use cases. To accomplish this, we validate the model on independent test sets representing distinct customer use cases. The crucial use case we monitored during these improvements was autonomous vehicle data:
Beyond standard pixel classification metrics, we also calculate the precision and recall of pixels within a small neighborhood of the segmentation edges. These metrics help us assess how well this tooling will actually benefit annotators for whom the most time-consuming aspect of semantic segmentation is the object boundary pixels.
To calculate these edge metrics, we first start with the predicted and ground truth masks. Keeping only the single-pixel boundary of each mask, a dilation operation is applied with some small threshold. Metrics are then calculated from the dilated edges. The above image visualizes this process.
The feedback loop
Beyond the initial training of the model, data annotated with Scale Rapid continues to get fed back into the training process. This creates a feedback loop: Autosegment helps users to more quickly produce higher quality annotations, and in turn, those annotations serve to continuously improve the automatic segmentation accuracy. Having high-quality training annotations is an indispensable prerequisite to training a high-performing model. This data annotation lifecycle is a unique characteristic of Scale’s annotation platform and fundamental to how we achieve best-in-class annotation speed and accuracy.
How you can use Autosegment
Autosegment is one of the multiple Active Tooling capabilities that exist in Scale’s products. Active Tools provide interactive ML-powered assistance to data annotators. By letting humans make changes to the model inputs in real time, model outputs can be quickly dialed in to match the situation precisely and minimize time spent on the task.
You can supercharge your data annotation using Scale Studio, which offers best-in-class annotation infrastructure to accelerate your in-house annotation team. Studio also offers a free tier to familiarize yourself with the platform and start annotating data. If you’ve decided you’d like to offload more parts of the data annotation pipeline, Scale Rapid provides an easy way to send your data to be annotated by Scale and get production-quality labels with no minimums.