Getting Started with Nucleus

byon February 10, 2021

Section 1: What is Nucleus?

Imagine you’re a machine learning engineer responsible for building a

pedestrian detection algorithm. It’s easy to get started, perhaps by training

a standard model on a small dataset. You get some preliminary results, and now

you want to improve on the baseline. But what do you do next?

In academia, common approaches are new algorithms, different architectures, or

larger hyperparameter searches. But academics fight with one hand tied behind

their backs: they usually cannot edit their datasets. In practical

applications, on the other hand, datasets are malleable and can be extended to

suit a problem’s needs. In practice, the most effective way to improve an ML

model is often to improve the dataset.


Graphic adapted from Andrej Karpathy’s slide deck [Train AI 2018]

At Scale, we specialize in creating high-quality, large-scale training

datasets for our customers. Expanding beyond our core strength of data

labeling, we developed Nucleus because we think it should be easier to

visualize, search, and improve computer vision datasets, and because we saw

many ML engineers struggling to answer two core questions:

  1. I know my ML model isn’t perfect. But where specifically does it fail today,
  2. and why?
  3. How can I improve my dataset to fix these failures?

With Nucleus, ML teams can use their models to discover issues in their

datasets, surface the part of the data distribution where models struggle

most, compare model runs, and prioritize what new unlabeled data to label


In the rest of this post, we’ll walk through a step by step tutorial on how to

get started with Nucleus. We’ll show you how to upload an image dataset with

ground truth annotations, as well as how to integrate the Nucleus python

client into a PyTorch training workflow to analyze model results.

Section 2: Getting Started with the Nucleus Python Client

The complete code for this tutorial is available in

this colab notebook. If you’d like to follow along step by step and create a personal dataset on

Nucleus, you will first need to make a

free scale account

and obtain a test API key. You can then copy the notebook and add your

personal key to upload data for free.

To continue with the example of pedestrian detection, we’ll use the PennFudan

pedestrian detection dataset to finetune a pre-trained Faster R-CNN model for

a pedestrian detection task. PennFudan consists of 170 images labeled with

both bounding boxes and segmentation masks. Some code has been omitted for

brevity, see

this colab notebook

for the complete implementation. Let’s get started by instantiating the

nucleus client and downloading the dataset.

Download the Dataset and Upload to Nucleus

nucleus = nucleus.NucleusClient(API_KEY)
# download the Penn-Fudan dataset
wget .
# extract it in the current folder

To get started with Nucleus, we will first upload the PennFudan dataset to the

Nucleus Dashboard. Uploading data is easy using the Nucleus python client —

either specify the local file path or image url if your data is stored in the

cloud. In this case, we’ve downloaded the PennFudan dataset locally, so for

each dataset item we specify the file path of the image. We’ll use the file

name as a reference_id that can be used to refer to the images in later API


nucleus_dataset = nucleus.create_dataset("PennFudan")
DATASET_ID =['dataset_id'] # Save unique ID for future use
def format_img_data_for_upload(img_dir: str):
  """Instantiates a Nucleus DatasetItem for all images in the PennFudan dataset.
  img_dir : str
      The filepath to the image directory

      A list of DatasetItem that can be uploaded via the Nucleus API
  img_list = []
  for img_filename in os.listdir(img_dir):
      img_id = img_filename.split('.')[0]
      item = DatasetItem(
          image_location=os.path.join(img_dir, img_filename), 
  return img_list

img_list = format_img_data_for_upload(IMG_DIR)

response = nucleus_dataset.append(img_list)

Once the images are uploaded, the next step is to upload the ground truth

bounding box annotations.

def format_annotations_for_upload(annotations_dir: str):
  """Instantiates a Nucleus BoxAnnotation for all ground truth annotations
  in the PennFudan dataset.

  annotations_dir : str
      The filepath to the directory storing mask annotations

      A list of BoxAnnotation items that can be uploaded via the Nucleus API
  annotation_list = []
  for mask_filename in os.listdir(annotations_dir):
      img_filename = mask_filename.split(".")[0][:-5]
      mask =, mask_filename))
      mask = np.array(mask)
      boxes = get_boxes(
          mask, use_height_width_format=True
      for i in range(len(boxes)):
          x = int(boxes[i][0])
          y = int(boxes[i][1])
          width = int(boxes[i][2])
          height = int(boxes[i][3])
          annotation = BoxAnnotation(
  return annotation_list

annotations = format_annotations_for_upload(ANNOTATIONS_DIR)
response = nucleus_dataset.annotate(annotations)

With the images and annotations uploaded, we can visualize the PennFudan

Dataset in the Nucleus Dashboard.


From the insights tab, we can get a quick overview of the labeled dataset. In

particular, you’ll notice that the annotations consist of exactly one class

(pedestrian), and that the dataset overall is biased toward images with fewer

annotations. The visualization, querying, and insights features in Nucleus

make it easy to dig into interesting subsets of the training dataset.

Next, we’ll set up the Pytorch DataLoaders needed for training and evaluation.

Setting up transforms and DataLoaders

def get_transform(train):
transforms = []
# converts the image, a PIL image, into a PyTorch Tensor
if train:
# during training, randomly flip the training images
# and ground truth for data augmentation
return T.Compose(transforms)
# use our dataset and defined transformations
dataset_train = PennFudanDataset('PennFudanPed', get_transform(train=True))
dataset_test = PennFudanDataset('PennFudanPed', get_transform(train=False))
# split the dataset in train and test set
N_TEST = 50
indices = torch.randperm(len(dataset_train)).tolist()
dataset_train =, indices[:-N_TEST])
dataset_test =, indices[-N_TEST:])

# define training and test data loaders
data_loader =
dataset_train, batch_size=2, shuffle=True, num_workers=0,

data_loader_test =
dataset_test, batch_size=1, shuffle=False, num_workers=0,

Use Slices to keep track of test and train sets within Nucleus

When evaluating model predictions, we’ll want to differentiate between train

and test examples within the Nucleus platform. We can do so using slices. A

slice is just a collection of dataset items. Below, we create slices for the

train and test sets.

""" Given a Pytorch dataset, returns a list of reference_ids corresponding
to the dataset's contents.

dataset : str
    A Pytorch dataset.

    A list of reference_ids for items to be uploaded to a Nucleus Slice.
reference_ids = []
for i in range(len(dataset)):
    _, _, img_file_name = dataset[i]
    ref_id = img_file_name.split(".")[0]  # remove file extension
return reference_ids

train_reference_ids = get_slice_item_reference_ids(dataset_train)
slice_train = nucleus_dataset.create_slice(name='training_set', reference_ids=train_reference_ids)

test_reference_ids = train_reference_ids = get_slice_item_reference_ids(dataset_test)
slice_test = nucleus_dataset.create_slice(name='test_set', reference_ids=test_reference_ids)

Once a slice is created via the Python client, it will appear in the Nucleus

interface. The slicing feature is useful for examining a particular subset of

the dataset. When viewing a slice, the search bar queries are applied only to

the items within the current slice.

Define the model & training loop

We are now ready to start using Nucleus to analyze experiments, the next step

is to upload model predictions. In the code block below, we finetune the

Faster R-CNN for 7 epochs before evaluating it on the test set consisting of

50 held-out examples.

In the code segment below, we’ve factored the details of the training loop

into some helper functions imported from vision/references/detection/. Here’s

the simplified training loop:

  # load an instance segmentation model pre-trained on COCO
  model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
  # get number of input features for the classifier
  in_features = model.roi_heads.box_predictor.cls_score.in_features
  # replace the pre-trained head with a new one
  model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

  return model

  device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
  # our dataset has two classes only - background and person
  num_classes = 2
  # get the model using our helper function
  model = get_model(num_classes)
  # move model to the right device
  # construct an optimizer
  params = [p for p in model.parameters() if p.requires_grad]
  optimizer = torch.optim.SGD(params, lr=0.005,
                              momentum=0.9, weight_decay=0.0005)
  # and a learning rate scheduler which decreases the learning rate by
  # 10x every 3 epochs
  lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
  # let's train it for 7 epochs
  num_epochs = 7

  for epoch in range(num_epochs):
  # train for one epoch, printing every 10 iterations
  train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=20)
  # update the learning rate
  # evaluate on the test dataset
  evaluate(model, data_loader_test, device=device)

The evaluation output shows that overall, the finetuned model achieves an

average precision of 0.827 and average recall of 0.386. Not bad for a first


But now what? These summary statistics don’t tell us very

much about how we could make this model better. That’s where Nucleus comes in.

Below, we create a model_run, which functions as a container for the

predictions from this experiment. Formatting the predictions and sending them

to Nucleus takes just a couple lines of code, and can easily be integrated

into a PyTorch evaluation function. Once these predictions are visible in the

Nucleus dashboard, there are all kinds of error analysis tools at our


Adding model predictions to Nucleus

The first step for uploading model predictions is to create a Nucleus model.

Below, we name the model fasterrcnn so that we can use it to track

all experiments using this architecture.

# Create a new model to track all of our Faster R-CNN experiments
nucleus_model = nucleus.add_model(name="Pytorch Faster R-CNN", reference_id="faster_r_cnn")

Then, we create a corresponding modelrun. The modelrun will keep

track of the results of this particular experiment. Once the modelrun object is created, we upload the modelpredictions.

# Predictions for entire dataset
def get_model_predictions(model, dataset):
  """Gets bounding box predictions on provided dataset, returns
  BoxPredictions that can be uplaoded to Nucleus via API

  model : 
      Pytorch Faster R-CNN model
  dataset :
      Dataset on which to conduct inference.

      A list of BoxPredictions that can be uploaded to Nucleus via API.
  predictions = []
  for i, (img, _, img_name) in enumerate(dataset):
      pred = model([])
      boxes = pred[0]["boxes"].cpu().detach().numpy()
      img_id = img_name.split(".")[0]
      for box in boxes:
          (xmin, ymin, xmax, ymax) = box
          x = int(xmin)
          y = int(ymin)
          width = int(xmax - xmin)
          height = int(ymax - ymin)
          curr_pred = BoxPrediction(
          curr_pred = BoxPrediction(label=PEDESTRIAN_LABEL, x=x, y=y, 
          width=width, height=height, reference_id=img_id)
  return predictions

dataset_test_all = PennFudanDataset('PennFudanPed', get_transform(train=False))
preds = get_model_predictions(model, dataset_test_all)

model_run = nucleus_model.create_run(name="Faster R-CNN Object Detection (Finetuning)", dataset=nucleus_dataset, predictions=preds)

After model predictions are uploaded, we are ready to commit the model run. On

commit, Nucleus automatically matches predictions to ground truth, calculates

IoU, and indexes false positives and false negatives.


Section 3: Analysis

Now that we have the newly created model_run uploaded to the Nucleus platform,

we can easily compare the predicted bounding boxes to the ground truth. Using

the patches view, we can examine the individual predictions, sorting by best

IoU, worst IoU, false positives, and false negatives.


In the patches view, we find a number of interesting observations:

Figure 1_false_positives

Figure 1 (above): Examining false positives, we see that

there are several instances where the model detected a cyclist as a

pedestrian, but there was no ground truth bounding box for the cyclist. It

would likely improve model performance to add a cyclist class; this would

encourage the model to differentiate between the two types of objects, rather

than just ignore cyclists.


Figure 2: The treatment of partially-occluded pedestrians is

not consistent across different images. The occluded pedestrians are labeled

in some places, but not in others.


Figure 3: There are many instances of missing ground truth

annotations, particularly for pedestrians in the background.


Figure 4: In general, it seems the model performs best on

pedestrians in the foreground, and exhibits poor performance on crowded


The worst-kept secret in machine learning

Nucleus shines a light on the worst-kept secret in machine learning:

most datasets are filled with errors. Not only are they often

full of bad annotations, but a typical ML engineer

would never even notice. All we see is “average precision =

0.827”. Annotation errors are particularly insidious because they put a

ceiling on model performance that is invisible to the ML engineer. The fact

that these errors remain invisible is a failure of ML tooling. Nucleus helps

address it. (In fact, one reason Scale can provide annotations at such high

quality is because we leverage tools like Nucleus internally.)

Addressing the problem

Having the ability to visualize predictions and view errors all in one place

gives us much better insight into what is actually going wrong with this

model’s performance. Looking through the test set, we find that many of the

predictions marked as false positives are actually just missing ground truth


This is concerning — without high-quality labels on the test set, it’s

difficult to trust evaluation metrics as an unbiased performance


Some actionable next steps we can take to improve this model’s performance


  • Add a new label class for cyclist
  • Correct missing annotations
  • Source additional examples of crowded scenes

Correcting missing Annotations

In order to continue making progress on this pedestrian detection task, we

need to be sure that the evaluation metrics on the test set are actually

reflective of good model performance. We can only trust these metrics when the

labels are exhaustive and high quality. Using Nucleus patches view, we correct

missing annotations in both the train and test sets. The mean average

precision (mAP) and mean average recall (mAR) from these experiments are

summarized in the following table.


We found that evaluating the baseline model against the improved test set

yields lower mAP and mAR, indicating that

evaluating against noisy labels in the test set overestimated model


Moreover, after also correcting labels in the training dataset, the precision

on the corrected test set improved by 7.98%, and recall improved by 7.36%

relative to the baseline model. Overall, these results demonstrate that

concentrating on label quality yields significant gains in overall


Finding additional examples for the training dataset

One finding that label corrections do not address is the poor performance on

crowded scenes. As discussed above, the dataset is heavily skewed towards

examples with two or fewer pedestrians. Crowded scenes are rare in our

dataset— one reliable way to address this edge case is to include more crowded

scenes in our training dataset. In addition to the analysis tools like the

patches and insights views, Nucleus supports a variety of features targeted at

mining large sets of unlabeled data for rare scenarios.

Since we’re already using all of the images in PennFudan for training and

evaluation, we’ll use another publicly available dataset, WiderPerson, as a

source of additional examples.

Putting PennFudan and WiderPerson together in one dataset, we now have a mix

of labeled and unlabeled data.

Once we’ve found one example from WiderPerson that we would want to include in

our training set, we can use image similarity search to find other, similar


Nucleus features like similarity search are particularly useful on massive

datasets where it’s impossible to look through every example. Using similarity

search, we can locate additional examples of pedestrians without having to

rely on annotations, helping prioritize what unlabeled data to label next.


In this tutorial, we’ve demonstrated how Nucleus can help developers get

actionable insights into model performance and dataset management. We’ve shown

how to get started by uploading images, annotations, and predictions via the

python API. Using Nucleus features for model analysis, we learned that

labeling quality can really interfere with evaluation metrics. If such errors

occur in small datasets like PennFudan, you can only imagine how commonly they

occur in larger datasets like Imagenet and MSCOCO. Using Nucleus to identify

issues with annotation taxonomy, quality, and consistency can thus play a huge

role in producing quality results.


This tutorial is adapted from

this PyTorch tutorial

and uses the

PennFudan Dataset.

The future of your industry starts here.