Imagine you’re a machine learning engineer responsible for building a pedestrian detection algorithm. It’s easy to get started, perhaps by training a standard model on a small dataset. You get some preliminary results, and now you want to improve on the baseline. But what do you do next?
In academia, common approaches are new algorithms, different architectures, or larger hyperparameter searches. But academics fight with one hand tied behind their backs: they usually cannot edit their datasets. In practical applications, on the other hand, datasets are malleable and can be extended to suit a problem’s needs. In practice, the most effective way to improve an ML model is often to improve the dataset.
At Scale, we specialize in creating high-quality, large-scale training datasets for our customers. Expanding beyond our core strength of data labeling, we developed Nucleus because we think it should be easier to visualize, search, and improve computer vision datasets, and because we saw many ML engineers struggling to answer two core questions:
I know my ML model isn’t perfect. But where specifically does it fail today, and why?
How can I improve my dataset to fix these failures?
With Nucleus, ML teams can use their models to discover issues in their datasets, surface the part of the data distribution where models struggle most, compare model runs, and prioritize what new unlabeled data to label next.
In the rest of this post, we’ll walk through a step by step tutorial on how to get started with Nucleus. We’ll show you how to upload an image dataset with ground truth annotations, as well as how to integrate the Nucleus python client into a PyTorch training workflow to analyze model results.
Section 2: Getting Started with the Nucleus Python Client
The complete code for this tutorial is available in this colab notebook. If you’d like to follow along step by step and create a personal dataset on Nucleus, you will first need to make a free scale account and obtain a test API key. You can then copy the notebook and add your personal key to upload data for free.
To continue with the example of pedestrian detection, we’ll use the PennFudan pedestrian detection dataset to finetune a pre-trained Faster R-CNN model for a pedestrian detection task. PennFudan consists of 170 images labeled with both bounding boxes and segmentation masks. Some code has been omitted for brevity, see this colab notebook for the complete implementation. Let’s get started by instantiating the nucleus client and downloading the dataset.
To get started with Nucleus, we will first upload the PennFudan dataset to the Nucleus Dashboard. Uploading data is easy using the Nucleus python client — either specify the local file path or image url if your data is stored in the cloud. In this case, we’ve downloaded the PennFudan dataset locally, so for each dataset item we specify the file path of the image. We’ll use the file name as a reference_id that can be used to refer to the images in later API requests.
Once the images are uploaded, the next step is to upload the ground truth bounding box annotations.
With the images and annotations uploaded, we can visualize the PennFudan Dataset in the Nucleus Dashboard.
From the insights tab, we can get a quick overview of the labeled dataset. In particular, you’ll notice that the annotations consist of exactly one class (pedestrian), and that the dataset overall is biased toward images with fewer annotations. The visualization, querying, and insights features in Nucleus make it easy to dig into interesting subsets of the training dataset.
Next, we’ll set up the Pytorch DataLoaders needed for training and evaluation.
When evaluating model predictions, we’ll want to differentiate between train and test examples within the Nucleus platform. We can do so using slices. A slice is just a collection of dataset items. Below, we create slices for the train and test sets.
Once a slice is created via the Python client, it will appear in the Nucleus interface. The slicing feature is useful for examining a particular subset of the dataset. When viewing a slice, the search bar queries are applied only to the items within the current slice.
We are now ready to start using Nucleus to analyze experiments, the next step is to upload model predictions. In the code block below, we finetune the Faster R-CNN for 7 epochs before evaluating it on the test set consisting of 50 held-out examples.
In the code segment below, we’ve factored the details of the training loop into some helper functions imported from vision/references/detection/. Here’s the simplified training loop:
The evaluation output shows that overall, the finetuned model achieves an average precision of 0.827 and average recall of 0.386. Not bad for a first shot!
But now what? These summary statistics don’t tell us very much about how we could make this model better. That’s where Nucleus comes in. Below, we create a model_run, which functions as a container for the predictions from this experiment. Formatting the predictions and sending them to Nucleus takes just a couple lines of code, and can easily be integrated into a PyTorch evaluation function. Once these predictions are visible in the Nucleus dashboard, there are all kinds of error analysis tools at our disposal.
The first step for uploading model predictions is to create a Nucleus model. Below, we name the model fasterrcnn so that we can use it to track all experiments using this architecture.
Then, we create a corresponding modelrun. The modelrun will keep track of the results of this particular experiment. Once the modelrun object is created, we upload the modelpredictions.
After model predictions are uploaded, we are ready to commit the model run. On commit, Nucleus automatically matches predictions to ground truth, calculates IoU, and indexes false positives and false negatives.
Section 3: Analysis
Now that we have the newly created model_run uploaded to the Nucleus platform, we can easily compare the predicted bounding boxes to the ground truth. Using the patches view, we can examine the individual predictions, sorting by best IoU, worst IoU, false positives, and false negatives.
In the patches view, we find a number of interesting observations:
Figure 1 (above): Examining false positives, we see that there are several instances where the model detected a cyclist as a pedestrian, but there was no ground truth bounding box for the cyclist. It would likely improve model performance to add a cyclist class; this would encourage the model to differentiate between the two types of objects, rather than just ignore cyclists.
Figure 2: The treatment of partially-occluded pedestrians is not consistent across different images. The occluded pedestrians are labeled in some places, but not in others.
Figure 3: There are many instances of missing ground truth annotations, particularly for pedestrians in the background.
Figure 4: In general, it seems the model performs best on pedestrians in the foreground, and exhibits poor performance on crowded images.
Nucleus shines a light on the worst-kept secret in machine learning: most datasets are filled with errors. Not only are they often full of bad annotations, but a typical ML engineer would never even notice. All we see is “average precision = 0.827”. Annotation errors are particularly insidious because they put a ceiling on model performance that is invisible to the ML engineer. The fact that these errors remain invisible is a failure of ML tooling. Nucleus helps address it. (In fact, one reason Scale can provide annotations at such high quality is because we leverage tools like Nucleus internally.)
Having the ability to visualize predictions and view errors all in one place gives us much better insight into what is actually going wrong with this model’s performance. Looking through the test set, we find that many of the predictions marked as false positives are actually just missing ground truth annotations. This is concerning — without high-quality labels on the test set, it’s difficult to trust evaluation metrics as an unbiased performance indicator.
Some actionable next steps we can take to improve this model’s performance are:
In order to continue making progress on this pedestrian detection task, we need to be sure that the evaluation metrics on the test set are actually reflective of good model performance. We can only trust these metrics when the labels are exhaustive and high quality. Using Nucleus patches view, we correct missing annotations in both the train and test sets. The mean average precision (mAP) and mean average recall (mAR) from these experiments are summarized in the following table.
We found that evaluating the baseline model against the improved test set yields lower mAP and mAR, indicating that evaluating against noisy labels in the test set overestimated model performance. Moreover, after also correcting labels in the training dataset, the precision on the corrected test set improved by 7.98%, and recall improved by 7.36% relative to the baseline model. Overall, these results demonstrate that concentrating on label quality yields significant gains in overall performance.
One finding that label corrections do not address is the poor performance on crowded scenes. As discussed above, the dataset is heavily skewed towards examples with two or fewer pedestrians. Crowded scenes are rare in our dataset— one reliable way to address this edge case is to include more crowded scenes in our training dataset. In addition to the analysis tools like the patches and insights views, Nucleus supports a variety of features targeted at mining large sets of unlabeled data for rare scenarios.
Since we’re already using all of the images in PennFudan for training and evaluation, we’ll use another publicly available dataset, WiderPerson, as a source of additional examples.
Putting PennFudan and WiderPerson together in one dataset, we now have a mix of labeled and unlabeled data.
Once we’ve found one example from WiderPerson that we would want to include in our training set, we can use image similarity search to find other, similar examples.
Nucleus features like similarity search are particularly useful on massive datasets where it’s impossible to look through every example. Using similarity search, we can locate additional examples of pedestrians without having to rely on annotations, helping prioritize what unlabeled data to label next.
In this tutorial, we’ve demonstrated how Nucleus can help developers get actionable insights into model performance and dataset management. We’ve shown how to get started by uploading images, annotations, and predictions via the python API. Using Nucleus features for model analysis, we learned that labeling quality can really interfere with evaluation metrics. If such errors occur in small datasets like PennFudan, you can only imagine how commonly they occur in larger datasets like Imagenet and MSCOCO. Using Nucleus to identify issues with annotation taxonomy, quality, and consistency can thus play a huge role in producing quality results.