Imagine you’re a machine learning engineer responsible for building a pedestrian detection algorithm. It’s easy to get started, perhaps by training a standard model on a small dataset. You get some preliminary results, and now you want to improve on the baseline. But what do you do next?
In academia, common approaches are new algorithms, different architectures, or larger hyperparameter searches. But academics fight with one hand tied behind their backs: they usually cannot edit their datasets. In practical applications, on the other hand, datasets are malleable and can be extended to suit a problem’s needs. In practice, the most effective way to improve an ML model is often to improve the dataset.
At Scale, we specialize in creating high-quality, large-scale training datasets for our customers. Expanding beyond our core strength of data labeling, we developed Nucleus because we think it should be easier to visualize, search, and improve computer vision datasets, and because we saw many ML engineers struggling to answer two core questions:
I know my ML model isn’t perfect. But where specifically does it fail today, and why?
How can I improve my dataset to fix these failures?
With Nucleus, ML teams can use their models to discover issues in their datasets, surface the part of the data distribution where models struggle most, compare model runs, and prioritize what new unlabeled data to label next.
In the rest of this post, we’ll walk through a step by step tutorial on how to get started with Nucleus. We’ll show you how to upload an image dataset with ground truth annotations, as well as how to integrate the Nucleus python client into a PyTorch training workflow to analyze model results.
Section 2: Getting Started with the Nucleus Python Client
The complete code for this tutorial is available in this colab notebook. If you’d like to follow along step by step and create a personal dataset on Nucleus, you will first need to make a free scale account and obtain a test API key. You can then copy the notebook and add your personal key to upload data for free.
To continue with the example of pedestrian detection, we’ll use the PennFudan pedestrian detection dataset to finetune a pre-trained Faster R-CNN model for a pedestrian detection task. PennFudan consists of 170 images labeled with both bounding boxes and segmentation masks. Some code has been omitted for brevity, see this colab notebook for the complete implementation. Let’s get started by instantiating the nucleus client and downloading the dataset.
2# download the Penn-Fudan dataset
3wget https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip.4# extract it in the current folder
To get started with Nucleus, we will first upload the PennFudan dataset to the Nucleus Dashboard. Uploading data is easy using the Nucleus python client — either specify the local file path or image url if your data is stored in the cloud. In this case, we’ve downloaded the PennFudan dataset locally, so for each dataset item we specify the file path of the image. We’ll use the file name as a reference_id that can be used to refer to the images in later API requests.
1nucleus_dataset = nucleus.create_dataset("PennFudan")2DATASET_ID= nucleus_dataset.info()['dataset_id'] # Save unique IDfor future use
34def format_img_data_for_upload(img_dir: str):5"""Instantiates a NucleusDatasetItemfor all images in the PennFudan dataset.67Parameters8----------9img_dir: str
10The filepath to the image directory
1112Returns13----------14List[DatasetItem]15A list ofDatasetItem that can be uploaded via the NucleusAPI16"""
17 img_list =18for img_filename in os.listdir(img_dir):19 img_id = img_filename.split('.')20 item =DatasetItem(21 image_location=os.path.join(img_dir, img_filename),22 reference_id=img_id,23)24 img_list.append(item)25return img_list
2627img_list =format_img_data_for_upload(IMG_DIR)2829response = nucleus_dataset.append(img_list)
Once the images are uploaded, the next step is to upload the ground truth bounding box annotations.
1def format_annotations_for_upload(annotations_dir: str):2"""Instantiates a NucleusBoxAnnotationfor all ground truth annotations
3in the PennFudan dataset.45Parameters6----------7annotations_dir: str
8The filepath to the directory storing mask annotations
910Returns11-------12List[BoxAnnotation]13A list ofBoxAnnotation items that can be uploaded via the NucleusAPI14"""
15 annotation_list =16for mask_filename in os.listdir(annotations_dir):17 img_filename = mask_filename.split(".")[:-5]18 mask =Image.open(os.path.join(annotations_dir, mask_filename))19 mask = np.array(mask)20 boxes =get_boxes(21 mask, use_height_width_format=True22)23for i inrange(len(boxes)):24 x =int(boxes[i])25 y =int(boxes[i])26 width =int(boxes[i])27 height =int(boxes[i])28 annotation =BoxAnnotation(29 label=PEDESTRIAN_LABEL,30 x=x,31 y=y,32 width=width,33 height=height,34 reference_id=img_filename,35)36 annotation_list.append(annotation)37return annotation_list
3839annotations =format_annotations_for_upload(ANNOTATIONS_DIR)40response = nucleus_dataset.annotate(annotations)
With the images and annotations uploaded, we can visualize the PennFudan Dataset in the Nucleus Dashboard.
From the insights tab, we can get a quick overview of the labeled dataset. In particular, you’ll notice that the annotations consist of exactly one class (pedestrian), and that the dataset overall is biased toward images with fewer annotations. The visualization, querying, and insights features in Nucleus make it easy to dig into interesting subsets of the training dataset.
Next, we’ll set up the Pytorch DataLoaders needed for training and evaluation.
1def get_transform(train):2transforms =3# converts the image, a PIL image, into a PyTorchTensor4transforms.append(T.ToTensor())5iftrain:6# during training, randomly flip the training images
7# and ground truth for data augmentation
8transforms.append(T.RandomHorizontalFlip(0.5))9returnT.Compose(transforms)10# use our dataset and defined transformations
11dataset_train =PennFudanDataset('PennFudanPed',get_transform(train=True))12dataset_test =PennFudanDataset('PennFudanPed',get_transform(train=False))13# split the dataset in train and test set14torch.manual_seed(1)15N_TEST=5016indices = torch.randperm(len(dataset_train)).tolist()17dataset_train = torch.utils.data.Subset(dataset_train, indices[:-N_TEST])18dataset_test = torch.utils.data.Subset(dataset_test, indices[-N_TEST:])1920# define training and test data loaders
21data_loader = torch.utils.data.DataLoader(22dataset_train, batch_size=2, shuffle=True, num_workers=0,23collate_fn=utils.collate_fn)2425data_loader_test = torch.utils.data.DataLoader(26dataset_test, batch_size=1, shuffle=False, num_workers=0,27collate_fn=utils.collate_fn)
When evaluating model predictions, we’ll want to differentiate between train and test examples within the Nucleus platform. We can do so using slices. A slice is just a collection of dataset items. Below, we create slices for the train and test sets.
1get_slice_item_reference_ids(dataset: torch.utils.data.Subset):2""" Given a Pytorch dataset, returns a list of reference_ids corresponding
3to the dataset's contents.45Parameters6----------7dataset: str
8APytorch dataset.910Returns11-------12List[str]13A list of reference_ids for items to be uploaded to a NucleusSlice.14"""
15reference_ids =16for i inrange(len(dataset)):17 _, _, img_file_name = dataset[i]18 ref_id = img_file_name.split(".") # remove file extension
19 reference_ids.append(ref_id)20return reference_ids
2122train_reference_ids =get_slice_item_reference_ids(dataset_train)23slice_train = nucleus_dataset.create_slice(name='training_set', reference_ids=train_reference_ids)2425test_reference_ids = train_reference_ids =get_slice_item_reference_ids(dataset_test)26slice_test = nucleus_dataset.create_slice(name='test_set', reference_ids=test_reference_ids)
Once a slice is created via the Python client, it will appear in the Nucleus interface. The slicing feature is useful for examining a particular subset of the dataset. When viewing a slice, the search bar queries are applied only to the items within the current slice.
We are now ready to start using Nucleus to analyze experiments, the next step is to upload model predictions. In the code block below, we finetune the Faster R-CNN for 7 epochs before evaluating it on the test set consisting of 50 held-out examples.
In the code segment below, we’ve factored the details of the training loop into some helper functions imported from vision/references/detection/. Here’s the simplified training loop:
1get_model(num_classes):2 # load an instance segmentation model pre-trained on COCO3 model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)4 # get number of input features for the classifier
5 in_features = model.roi_heads.box_predictor.cls_score.in_features6 # replace the pre-trained head with a newone7 model.roi_heads.box_predictor=FastRCNNPredictor(in_features, num_classes)89return model
1011 device = torch.device('cuda')if torch.cuda.is_available()else torch.device('cpu')12 # our dataset has two classes only - background and person
13 num_classes =214 # get the model using our helper function15 model =get_model(num_classes)16 # move model to the right device
17 model.to(device)18 # construct an optimizer
19 params =[p for p in model.parameters()if p.requires_grad]20 optimizer = torch.optim.SGD(params, lr=0.005,21 momentum=0.9, weight_decay=0.0005)22 # and a learning rate scheduler which decreases the learning rate by
23 # 10x every 3 epochs
24 lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,25 step_size=3,26 gamma=0.1)27 # let's train it for7 epochs
28 num_epochs =72930 model.train()31for epoch inrange(num_epochs):32 # train for one epoch, printing every 10 iterations
33train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=20)34 # update the learning rate
35 lr_scheduler.step()36 # evaluate on the test dataset
37evaluate(model, data_loader_test, device=device)
The evaluation output shows that overall, the finetuned model achieves an average precision of 0.827 and average recall of 0.386. Not bad for a first shot!
But now what? These summary statistics don’t tell us very much about how we could make this model better. That’s where Nucleus comes in. Below, we create a model_run, which functions as a container for the predictions from this experiment. Formatting the predictions and sending them to Nucleus takes just a couple lines of code, and can easily be integrated into a PyTorch evaluation function. Once these predictions are visible in the Nucleus dashboard, there are all kinds of error analysis tools at our disposal.
The first step for uploading model predictions is to create a Nucleus model. Below, we name the model fasterrcnn so that we can use it to track all experiments using this architecture.
1# Create a newmodel to track all of our FasterR-CNN experiments
2nucleus_model = nucleus.add_model(name="Pytorch Faster R-CNN", reference_id="faster_r_cnn")
Then, we create a corresponding modelrun. The modelrun will keep track of the results of this particular experiment. Once the modelrun object is created, we upload the modelpredictions.
1# Predictionsfor entire dataset
2def get_model_predictions(model, dataset):3"""Gets bounding box predictions on provided dataset, returns
4BoxPredictions that can be uplaoded to Nucleus via API56Parameters7----------8model:9PytorchFasterR-CNN model
10dataset:11Dataset on which to conduct inference.1213Returns14-------15List[BoxPrediction]16A list ofBoxPredictions that can be uploaded to Nucleus via API.17"""
18 predictions =19 model.eval()20for i,(img, _, img_name)inenumerate(dataset):21 pred =model([img.to(device)])22 boxes = pred["boxes"].cpu().detach().numpy()23 img_id = img_name.split(".")24for box inboxes:25(xmin, ymin, xmax, ymax)= box
26 x =int(xmin)27 y =int(ymin)28 width =int(xmax - xmin)29 height =int(ymax - ymin)30 curr_pred =BoxPrediction(31 label=PEDESTRIAN_LABEL,32 x=x,33 y=y,34 width=width,35 height=height,36 reference_id=img_id
37)38 curr_pred =BoxPrediction(label=PEDESTRIAN_LABEL, x=x, y=y,39 width=width, height=height, reference_id=img_id)40 predictions.append(curr_pred)41return predictions
4243dataset_test_all =PennFudanDataset('PennFudanPed',get_transform(train=False))44preds =get_model_predictions(model, dataset_test_all)4546model_run = nucleus_model.create_run(name="Faster R-CNN Object Detection (Finetuning)", dataset=nucleus_dataset, predictions=preds)
After model predictions are uploaded, we are ready to commit the model run. On commit, Nucleus automatically matches predictions to ground truth, calculates IoU, and indexes false positives and false negatives.
Section 3: Analysis
Now that we have the newly created model_run uploaded to the Nucleus platform, we can easily compare the predicted bounding boxes to the ground truth. Using the patches view, we can examine the individual predictions, sorting by best IoU, worst IoU, false positives, and false negatives.
In the patches view, we find a number of interesting observations:
Figure 1 (above): Examining false positives, we see that there are several instances where the model detected a cyclist as a pedestrian, but there was no ground truth bounding box for the cyclist. It would likely improve model performance to add a cyclist class; this would encourage the model to differentiate between the two types of objects, rather than just ignore cyclists.
Figure 2: The treatment of partially-occluded pedestrians is not consistent across different images. The occluded pedestrians are labeled in some places, but not in others.
Figure 3: There are many instances of missing ground truth annotations, particularly for pedestrians in the background.
Figure 4: In general, it seems the model performs best on pedestrians in the foreground, and exhibits poor performance on crowded images.
Nucleus shines a light on the worst-kept secret in machine learning: most datasets are filled with errors. Not only are they often full of bad annotations, but a typical ML engineer would never even notice. All we see is “average precision = 0.827”. Annotation errors are particularly insidious because they put a ceiling on model performance that is invisible to the ML engineer. The fact that these errors remain invisible is a failure of ML tooling. Nucleus helps address it. (In fact, one reason Scale can provide annotations at such high quality is because we leverage tools like Nucleus internally.)
Having the ability to visualize predictions and view errors all in one place gives us much better insight into what is actually going wrong with this model’s performance. Looking through the test set, we find that many of the predictions marked as false positives are actually just missing ground truth annotations. This is concerning — without high-quality labels on the test set, it’s difficult to trust evaluation metrics as an unbiased performance indicator.
Some actionable next steps we can take to improve this model’s performance are:
In order to continue making progress on this pedestrian detection task, we need to be sure that the evaluation metrics on the test set are actually reflective of good model performance. We can only trust these metrics when the labels are exhaustive and high quality. Using Nucleus patches view, we correct missing annotations in both the train and test sets. The mean average precision (mAP) and mean average recall (mAR) from these experiments are summarized in the following table.
We found that evaluating the baseline model against the improved test set yields lower mAP and mAR, indicating that evaluating against noisy labels in the test set overestimated model performance. Moreover, after also correcting labels in the training dataset, the precision on the corrected test set improved by 7.98%, and recall improved by 7.36% relative to the baseline model. Overall, these results demonstrate that concentrating on label quality yields significant gains in overall performance.
One finding that label corrections do not address is the poor performance on crowded scenes. As discussed above, the dataset is heavily skewed towards examples with two or fewer pedestrians. Crowded scenes are rare in our dataset— one reliable way to address this edge case is to include more crowded scenes in our training dataset. In addition to the analysis tools like the patches and insights views, Nucleus supports a variety of features targeted at mining large sets of unlabeled data for rare scenarios.
Since we’re already using all of the images in PennFudan for training and evaluation, we’ll use another publicly available dataset, WiderPerson, as a source of additional examples.
Putting PennFudan and WiderPerson together in one dataset, we now have a mix of labeled and unlabeled data.
Once we’ve found one example from WiderPerson that we would want to include in our training set, we can use image similarity search to find other, similar examples.
Nucleus features like similarity search are particularly useful on massive datasets where it’s impossible to look through every example. Using similarity search, we can locate additional examples of pedestrians without having to rely on annotations, helping prioritize what unlabeled data to label next.
In this tutorial, we’ve demonstrated how Nucleus can help developers get actionable insights into model performance and dataset management. We’ve shown how to get started by uploading images, annotations, and predictions via the python API. Using Nucleus features for model analysis, we learned that labeling quality can really interfere with evaluation metrics. If such errors occur in small datasets like PennFudan, you can only imagine how commonly they occur in larger datasets like Imagenet and MSCOCO. Using Nucleus to identify issues with annotation taxonomy, quality, and consistency can thus play a huge role in producing quality results.