Peer-Reviewed Doesn’t Mean Perfect Data

by Aerin Kim and Samyak Parajuli on February 14th, 2022

Peer-Reviewed Doesn’t Mean Perfect Data cover

Supervised learning works reliably to solve a wide range of problems, and meanwhile its success depends on the assumption that all training data is correctly labeled. Peer-reviewed research demonstrates that canonical datasets such as MNIST and ImageNet contain many mislabeled samples, in spite of their widespread usage in research and industry.

In 2021, Curtis Northcutt et al. published research that revisits labels in commonly used datasets and identifies images to be re-labeled, to cross-reference the authors’ guess of a mislabeled sample with the wisdom of the “community.” Let’s dive into some industry-standard datasets in Nucleus, and then decide which samples we want to send out to be labeled with Scale Rapid.

Starting with Self-Driving Data:

Berkeley DeepDrive 100K

BDD 100K is the largest and most diverse open source driving dataset published thus far by Berkeley AI Research (BAIR). Autonomous driving is a very complex problem for which data quality is critical—accordingly, we’ll present an example workflow that can find and correct mislabeled/unlabeled samples using two of our products, Nucleus and Rapid.

Before we dive into this dataset, let’s describe Nucleus. Nucleus provides a unified platform to visualize data and query relevant insights. Through the Nucleus Python API, we will upload data, annotations, and metadata. To learn how to do this yourself, follow the guide here. We follow a similar process to upload our data:

1[object Object]

Since this particular task is object detection, we need to specify the coordinates for our BoxAnnotation instance, making sure to convert from x1, x2, y1, y2 format to x1, y1, width, height:

1[object Object]

Then, we upload the model predictions and commit the model run in order to get insights about the data:

1[object Object]

Queries and Class Insights:

Let’s look at the Insights section of our ground truth dataset and look at the distribution of annotations per image. As we can see, 21337 images have 0 annotations.

By querying annotations.count = 0 (or just by clicking on the 0 bar), we can see exactly what these images are and create a slice to send them for labelling:

Examining False Positives:

After uploading a model prediction we can click on the patches view and filter by false positives:

This view of our dataset shows us many examples of images that haven’t been thoroughly labeled.

Image Autotag:

Another convenient feature Nucleus provides is Image Autotag. Image Autotag allows you to select a particular subset of images and then uses embeddings from a model to find other images/objects in your dataset that are visually similar, which automatically creates a new metadata label. We can iteratively refine this search and then use a query like autotag.image.autotag_label > threshold AND annotations.true_label = 0 to find potential inconsistencies.

Or we can simply look for a class that hasn’t been labeled that we want to add to our annotations. For example, in the below video we use Autotag to automatically find all the police cars in our dataset. We first look through our dataset and manually find a few images of interest (in our case we look for police cars), upon finding them we click on the plus sign located at the bottom right hand corner of the image. After a cursory look through our dataset, we click on the button on the top right hand corner of the page:

We can now use Autotag to query for only police car examples with the query autotag.image = police_cars:

Once we have identified these potential mislabeled and unlabeled samples, we can send this slice to Scale Rapid to get them labeled.

Sending to Rapid

On the top right hand corner, there is a button that looks like:

After clicking on it, we get a prompt that looks like this:

From here, we can choose the particular Rapid project we want to send this data to and have it labeled. We present an example of the whole pipeline from Nucleus to Rapid below.

Confident Learning

Confident Learning: Estimating Uncertainty in Dataset Labels” by Northcutt et al. presents a thorough investigation into the problem of mislabels in datasets. They provide a theoretically sound method in determining potential mislabeled examples and empirically show effective results at

They find many errors in the dataset, some of which are particularly egregious, such as labeling a lion as a monkey:

However, as a result of the AWS Mechanical Turk consensus used for the final guess, we can see that their relabeling approach results in errors. For instance, these _are _actually asian elephants and not tuskers.

Below, we highlight a workflow to integrate cleanlab functionality with Nucleus and Scale Rapid to identify and correct incorrectly labeled samples.

Confident Learning in Nucleus

Through one API call, we use cleanlab to get the indices in Imagenet’s validation set for potentially mislabeled samples.

1[object Object]

We can then add Dataset Items with metadata mislabel_score which is essentially a metric that gives an estimate of how bad the potentially mislabeled examples are. On a high level, this uses a model’s average confidence of a particular class (“self-confidence”) compared with it’s outputted probability for the particular sample:

1[object Object]

We add this to image metadata so we can later sort by descending, with the examples in the front representative of examples that are estimated to most likely be labeled incorrectly.

1[object Object]

We now add annotations with the original ground truth label, a metadata attribute flagging if it is a potential mislabel or not and a final attribute indicating the most likely “true label” guess we have for the example:

1[object Object]

After uploading all this information to Nucleus, we are ready for some analysis. After we run the query annotations.metadata.potential_mislabel = 1, we can see that there are 6278 potential results.

From here, we can go to the top right and make a slice of all results of this query to segment out our relevant set:

This allows us to get a specific partition of our dataset that we can directly get insights for. For example, this is the class distribution within just our potentially mislabeled slice:

As mentioned previously, we can now sort our slice by mislabel_score:

Let’s look at one of the examples in the beginning:

We see that the original ground truth label was red panda, but as we can verify, the actual label should be giant panda!

So, from here we can manually go through all 6278 items ourselves, or we can provide instructions for taskers to do the task.

One easy way to do this is through Scale Rapid which requires no long-term commitment and has a pay-as-you-go model for you to send unlabeled tasks for a wide variety of task types, including text annotation, image classification, and semantic segmentation.

Sending for (Re-)Labeling with Scale Rapid

Scale Rapid is the fastest way to production-quality labels, without any data minimums. Oftentimes, getting labeled data is the blocker in many AI applications, Rapid offers a fast way to get this done, while providing granular insights about your data.

We can create a Rapid project through the Python API with the following code:

1[object Object]

And then we create references to the particular dataset and slice from Nucleus we want to work with:

1[object Object]

We then reference the relevant metadata information and upload the images we want to be relabeled to Rapid:

1[object Object]

Checking the Rapid dashboard, we see that all of our data has been uploaded. We were able to add metadata indicating the original groundtruth, the predicted label, the coarse label, the predicted coarse label, and an estimate uncertainty of how mislabeled the sample is.

Now that we have the data that we want relabeled uploaded, we need to provide instructions for our taskers.

Initially, we fill out a description of the Summary and Workflow for our project. Here, it’s important to provide a good reference of what a correct label is. Scale Rapid make this easy to do by encouraging you to provide a description of your labels and to upload example images for each class:

Rapid has a simple interface to provide well-labeled examples for individual classes.

After adding our example images and text descriptions:

Upon saving these instructions, we send a calibration batch. This represents a sample of our sent data, which we can audit and provide feedback on, before our full data set gets labeled.

After we get the results of our calibration batch, we can audit the responses, choosing which ones to accept or reject:

Taskers also give feedback so we can update our instructions based on. For example, some taskers were confused on what to do if they believed an image did not belong to any category. In response, we added another label for “None” and updated the instructions accordingly.

After this calibration batch has been audited, we can send the full set of data we want to be re-labeled. Currently, Rapid only supports batches of size 3000 or less, but we can split our data into two batches and send them separately. After our batches have been labeled, we can see different metrics relating to the task itself and the quality:


Below, we highlight some images and compare them to the labels that Northcutt received with Amazon Mechanical Turk. Rapid’s hierarchical system of taskers in which we have different levels of taskers and reviewers, we can achieve higher quality labels. These examples also indicate the increased level of complexity we can submit for our project. The Northcutt Mechanical Turk (“MTurk”) task only asked to choose between two labels. As we saw above, we were able to submit tasks that allowed taskers to choose between one of any of the 1000 labels possible.

img - 0
img - 1
img - 2
img - 3

Wrapping up:

As you develop an affinity for building accurate, business-relevant machine learning models, you typically come to learn that model performance not only scales with dataset size, but also with dataset label quality. As this overview shows, many canonical datasets, Berkeley DeepDrive, and ImageNet included, contain copious errors. Of course dataset quality can be subjective: if all label errors seem to be in minority classes that don’t relate to the business problem you’re solving, you might deem a certain dataset of sufficient quality. Similarly, datasets that contain multiple label classes, may only be missing labels of a specific class. Digging deep into errors in ImageNet is a great start, but we invite the academic community to develop error profiles of more open source datasets.

As you embark on training a model that suits your computer vision needs, perhaps on ImageNet, consider identifying mislabeled images with Scale Nucleus, and then sending the relevant images (in your classes of interest, of course) out for labeling with Scale Rapid. If you’re an academic looking to augment this research or iterate on it, please check out our university research program or our open datasets page. You can find the code repository for this blog on GitHub here.