Human Annotations Remain Indispensable for Developing Deep Learning Models

on October 4th, 2021

Human Annotations Remain Indispensable for Developing Deep Learning Models cover

It is undeniable that high-quality labeled datasets play a crucial role in the development of new deep learning algorithms. However, the world's understanding of ML and deep learning remains in its early days. That’s why Scale's applied ML and ML research teams work to better understand the latest in ML research and how we may overcome the biggest challenges facing AI development today – both for our customers and the industry at large.

Recently, our team of researchers conducted an in-depth analysis of the state of data in computer vision. The research paper, which was accepted to the Human-in-the-Loop Learning Workshop at ICML 2021, found that high-quality annotations remain indispensable for developing accurate deep learning models.

The full paper can be found here. We've summarized the key findings below, including:

  1. the impact of dataset size and quality on model accuracy and robustness;
  2. the effectiveness of pretraining;
  3. the advancement of and hurdles facing synthetic datasets;
  4. self-supervised learning;
  5. continual learning and its relationship with large datasets; and
  6. the phenomenon of double descent

Our key finding is that high-quality annotations remain indispensable for developing accurate deep learning models.

Tradeoff Between Pretraining and Fine-Tuning Dataset Sizes

Consider a fixed labeling budget. Would it be more advantageous to allocate funds to label a general purpose dataset, for example classification, or to focus on a more specific dataset, such as segmentation and object detection?

In our survey, we found that pretraining on ImageNet does not help when fine-tuning on a different task, namely object detection on MSCOCO. In particular, we saw significant performance deterioration when using less than 10% of the COCO dataset. But by comparing the AP@50 and AP metrics, we found that pretraining can help the model achieve better classification (same task as pretraining) but not localization (different task). If models are trained to saturation, they exhibit a similar level of accuracy; nonetheless, the pretrained model could achieve the same accuracy with fewer iterations.

Effective of Pretraining

In this section, we survey the research using JFT-300M and the Instagram dataset because they are among the largest datasets. The JFT-300M contains 300 million images with 375 million noisy labels across 18,291 classes. The Instagram dataset contains 3.5 billion images with 17,000 classes and unknown amounts of label noise. The major takeaway is that if pretraining is similar to the target tasks, then pretraining on massive datasets unequivocally helps. We also observed three phenomena:

  1. The performance of the model grows as a logarithmic function of the pretraining dataset size.
  2. A sufficiently large model is crucial to capture the performance boost offered by pretraining on massive datasets.
  3. Hyperparameter during pretraining can be drastically different from training on target tasks.

Synthetic Datasets

Synthetic datasets are a promising avenue to pursue when collecting real-world data is impossible. However, it is well known that deep neural networks are sensitive to distribution change. A model trained on a synthetic dataset still must be fine-tuned on real data. For self-driving car applications, Richter et al. found that a model trained with just one-third real data alongside synthetic renders from a video game engine could outperform training on all real data. Real-world data, however, was critical to achieving this performance, accounting for an increase of over 20 percentage points from the zero-shot synthetic data baseline.

Self-Supervised Learning

In recent works, self-supervised learning has been shown to produce excellent representations. Combined with a simple linear classifier or regressor, the representations can be used for downstream tasks with high accuracy. The question we aim to answer is whether self-supervision is beneficial when large labeled datasets exist for downstream tasks. In Hendrycks et al., it was found that self-supervised learning does not improve accuracy, but does enhance robustness against noisy labels, adversarial examples, and input corruptions.

For NLP applications, self-supervision has been the key for the latest model structures, such as BERT, RoBERTa, and XLM. These models are trained to predict words or tokens that are masked or replaced. In computer vision, contrastive learning has been shown to produce a good starting point for classifiers. Ultimately, self-supervised learning can reduce the number of annotations required to train, but it is best used in tandem with fully supervised learning.

Continual Learning

The autonomous vehicle industry is a prominent example of lifelong learning with deep learning. As vehicles collect new data, images with a certain failure mode can be queried from the stream of data and relabeled by humans. The model then can learn these failure modes and be redeployed to the fleet of vehicles. This cycle allows the model to continuously improve its accuracy.

When continually training models, however, catastrophic forgetting might occur in which the model no longer can have high accuracy on previous data. Using the pseudo-rehearsal method and regularization-based method could alleviate catastrophic forgetting, but no current research solves this problem.

Double Descent

The double descent phenomenon postulates that medium-sized models are more susceptible to overfitting than their smaller and larger counterparts. As model complexity increases, the test error will start off as a large number, decrease, increase, and then finally decrease again.

Double Descent
Image Courtesy of OpenAI:

When early stopping is used, the double descent phenomenon becomes less pronounced. While there are still many open questions regarding optimal model sizing in small data regimes, we are far from saturating on model capacity in massive data regimes. In general, we conclude that — given sufficient labeled data with little label noise — larger models trained with early stopping generally are more likely to outperform smaller models.


Ultimately, training on seemingly infinite data is hard and we’re only beginning to understand the benefits and pitfalls. While training in this regime is costly and requires significant compute resources, we believe that, in the long run, the cost of each step will be reduced. We see promise in new techniques like large-scale self-supervised learning. But with where the research and technology stands today, human-in-the-loop learning including data annotation will remain indispensable to train deep learning models on various machine learning tasks.

Our full paper can be found at Arxiv.