The 10 Best Public Datasets for Object Detection in 2022
Today, it is widely recognized that the most important ingredient for building a great machine learning model is training data. This practice of systematically building datasets to achieve better model performance is often referred to as data-centric AI. In the beginning however, most users don't have a lot of data on their own. So how can you get started? A good place to kick-off is typically a large public dataset.
However, there are thousands of amazing public datasets available to choose from, which is why we took it upon ourselves to curate the best public datasets for the most common problems and tasks in machine learning.
In this list, we are covering the top 10 datasets for object detection. Object detection is one of the most common task types in computer vision and applied across use cases from retail, to facial recognition, over autonomous driving to medical imaging.
ImageNet
Size
14 million images, annotated in 20,000 categories (1.2M subset freely available on Kaggle)
License
Custom, see details
Cite
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255).
Has annotations?
Yes
Benchmarks
Papers (8763) / Benchmarks (85) / Papers with Code
Get started
View this dataset in Scale Nucleus / dataset website / download
ImageNet is one of the most famous public datasets for visual object recognition. Building on top of WordNet, Prof. Fei-Fei Li of Stanford started to work on ImageNet in 2007.
The dataset contains more than 14 million images that have been manually labeled in more than 20,000 categories, representing one of the richest taxonomies in any computer vision dataset.
The dataset came to prominence in 2012 when AlexNet achieved a top-5 error rate at 16% in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). This was almost 11% better than the closest competitor and made headlines as it became clear that neural networks might soon be able to outperform human perception on object recognition tasks.
Today, ImageNet is still used as a common benchmark for any researcher or practitioner working on visual object detection.
COCO (Microsoft Common Objects in Context)
Size
330K images, 1.5M objects, annotated in 91 categories
License
Creative Commons 4.0 (details)
Cite
Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L. & Dollár, P. (2014). Microsoft COCO: Common Objects in Context (cite arXiv: 1405.0312)
Has annotations?
Yes
Benchmarks
Papers (6211) / Benchmarks (76) / Papers with Code
Get started
View this dataset in Scale Nucleus / dataset website / download
COCO (from Microsoft) is a large scale dataset showing Common Objects in their natural COntext.
Originally published in 2015, the dataset comprises 1.5M objects in more than 330,000 images. The objects are shown in natural, often complex environments. More than 200,000 images are fully annotated.
COCO contains annotations of many types, such as human keypoints, panoptic segmentation and bounding boxes. Due to its comprehensive size and rich annotations, it is a go-to dataset for challenges and benchmarks.
PASCAL VOC
Size
1,464 images for training, 1,449 images for validation
License
Custom (details)
Cite
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J. & Zisserman, A. (2012). The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results here.
Has annotations?
Yes
Benchmarks
Papers (250) / Benchmarks (47) / Papers with Code
Get started
View this dataset in Scale Nucleus / dataset website / download
PASCAL VOC consists of standardized image data sets for object class recognition that have been part of challenges since 2005 (see leaderboards).
The most recent version of the challenge is the 2012 edition containing 20 object categories such as animals, vehicles and household objects. Each and every image has annotations for the object class, a bounding box as well as a pixel-wise semantic segmentation annotation.
BDD100K (UCBerkeley "Deep Drive")
Size
100K videos (1.8TB)
License
BSD 3-Clause (details)
Cite
Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., ... & Darrell, T. (2020). Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2636-2645).
Has annotations?
Yes
Benchmarks
Papers (125) / Benchmarks (12) / Papers with Code
Get started
View this dataset in Scale Nucleus / dataset website / download
Berkeley Deep Drive, commonly referred to as BDD100K, is a highly popular autonomous driving dataset. It contains 100,000 videos split into training, validation and test set. There are also versions with image subsets from the videos comprising 100,000 or 10,000 images that are split analogously.
The dataset includes ground truth annotations for all common road objects in JSON format, lane markings, pixel-wise semantic segmentation, instance segmentation, panoptic segmentation and even pose-estimation labels. On top of ground truth labels, the dataset also features metadata such as timeofday and the weather. As the cherry on top, there is even a library of more than 300 pre-trained models for the dataset, which can be explored in the BDD Model Zoo.
To this day, its size and extensive features have made BDD100K a go-to option for multitask learning and computer vision challenges (see latest ECCV 2022 and CVPR 2022 challenges).
Visual Genome
Size
108,077 images, 3.8 million object instances
License
Creative Commons 4.0 (details)
Cite
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., ... & Fei-Fei, L. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1), 32-73.
Has annotations?
Yes (including object relationships)
Benchmarks
Papers (772) / Benchmarks (17) / Papers with Code
Get started
View this dataset in Scale Nucleus / dataset website / download
Visual Genome is a large image dataset based on MS COCO, with more than 100,000 annotated images that is a standard benchmark for object detection but also and especially for scene description and question answering tasks.
Beyond object annotations, Visual Genome is designed for question answering and describing the relationships between all the objects. The ground truth annotations include more than 1.7 million question-answer pairs which is an average of 17 questions per image. Questions are evenly distributed between What, Where, When, Who, Why and How.
For example, in an image showing a pizza and people around it, question-answer pairs include: “What color is the plate?”, ”How many people are eating?”, or “Where is the pizza?”. All relationships, attributes and metadata is mapped to Wordnet Synsets.
nuScenes
Size
1,000 driving scenes, containing 1.4M images, 1.4M bounding boxes and 390K lidar sweeps
License
Custom (details)
Cite
Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., ... & Beijbom, O. (2020). nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11621-11631).
Has annotations?
Yes
Benchmarks
Papers (563) / Benchmarks (12) / Papers with Code
Get started
View this dataset in Scale Nucleus / dataset website / download
Developed by the team at Motional, nuScenes is one of the most comprehensive large-scale datasets for autonomous driving.
It contains 1,000 driving scenes from Boston and Singapore, which comprise an astonishing 1.4M images, 1.4M bounding boxes, 390,000 lidar sweeps and 1.4M radar sweeps. The object detections include both 2D and 3D bounding boxes in 23 object classes.
While most other datasets in the autonomous driving domain are solely focused on camera based perception, nuScenes aims to cover the entire spectrum of sensors, much like the original KITTI dataset, but with a higher volume of data.
While developed by a private company, the company is free to use in any non-commercial setting. For commercial purposes licenses can be obtained from Motional.
DOTA v2.0
Size
11,268 images, 1.8M objects instances
License
Custom (details)
Cite
Xia, G. S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., ... & Zhang, L. (2018). DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3974-3983).
Has annotations?
Yes
Benchmarks
Papers (142) / Benchmarks (4) / Papers with Code
Get started
View this dataset in Scale Nucleus / dataset website / download
DOTA is a highly popular dataset for object detection in aerial images, collected from a variety of sources, sensors and platforms.
The images range from a low of 800x800 to 200,000x200,000 pixels in resolution and contain objects of many different types, shapes and sizes. The dataset continues to be updated regularly and is expected to grow further.
The ground truth annotations are done by expert annotators in aerial imaging into 18 categories, with a total of 1.8M object instances. Using the oriented bounding box format (OBB) each annotation also includes a difficulty score, describing how hard it would be to detect the object in question.
The dataset is split into train, validation and testing and is commonly used in computer vision challenges such as LUAI 2021.
KITTI Vision Benchmark Suite
Size
100,000 images across multiple hours of driving. 7.5K in object detection benchmark
License
Creative Commons Attribution-NonCommercial-ShareAlike 3.0 (details)
Cite
Geiger, A., Lenz, P., & Urtasun, R. (2012, June). Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition (pp. 3354-3361). IEEE.
Has annotations?
Yes
Benchmarks
Papers (2199) / Benchmarks (115) / Papers with Code
Get started
View this dataset in Scale Nucleus / dataset website / download
KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most famous datasets in autonomous driving and computer vision. Recorded in Karlsruhe in 2012, it was the first large scale dataset containing a full sensor suite for autonomous driving, which was also fully annotated to be used as a benchmark.
The dataset contains two camera streams (high resolution RGB and grayscale stereo), a lidar with 100k points per frame, GPS / IMU readings, object tracklets and calibration data. It can be used for a variety of tasks in autonomous driving. The 2D and 3D object detection benchmarks contain 7,500 training and 7,500 test images respectively, which can be downloaded and accessed in a variety of formats.
Davis 2017
Size
150 scenes
License
Unknown
Cite
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., & Van Gool, L. (2017). The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675.
Has annotations?
Yes
Benchmarks
Papers (166) / Benchmarks (8) / Papers with Code
Get started
View this dataset in Scale Nucleus / dataset website / download
DAVIS stands for Densely Annotated VIdeo Segmentation and comprises a data set of 150 videos split into training, evaluation and testing. It is a state of the art benchmark dataset for object segmentation in videos and has been part of several challenges.
The dataset contains 150 short scenes with about 13,000 individual frames that are split into training, validation and testing. Challenge evaluations are available for supervised (human annotated), semi-supervised and unsupervised approaches.
SUN RGB-D
Size
10,335 images with 700 object classes
License
Unknown
Cite
Song, S., Lichtenberg, S. P., & Xiao, J. (2015). Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 567-576).
Has annotations?
Yes
Benchmarks
Papers (281) / Benchmarks (12) / Papers with Code
Get started
View this dataset in Scale Nucleus / dataset website / download
SUN-RGB is a common benchmark dataset for object detection. Released by researchers from Princeton university in 2015, it contains more than 10,000 hand-labeled images that are split equally into training and testing. The images are from scenes recorded indoors and contain common objects in offices and homes
The objects in the images are fully annotated with 700 distinct object classes, including both 2D and 3D bounding boxes, semantic segmentation as well as room layout. There are both 2D and 3D object detection challenges available.
What’s next?
Are you ready to get started using any of the above datasets? Here are a few resources to get you started on computer vision datasets and data annotation:
- Data Labeling: The Authoritative Guide
- Explore open datasets in Scale Nucleus
- Use natural language to search ImageNet