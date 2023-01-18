There are many different subfields or subdomains within computer vision. Some of the most common include: object classification, object detection, object recognition, object tracking, event detection, and pose estimation. In this chapter, we will provide a brief overview as well as an example of these subfields.

Object Classification

With object classification, models are trained to identify the class of a singular object in an image. For example, given an image of an animal, the model would return what animal is identified in the image (e.g. an image of a cat should come back as “cat”).

Object Detection

With object detection, models are trained to identify occurrences of specific objects in a given image or video. For example, given an image of a street scene and an object class of “pedestrian”, the model would return the locations of all pedestrians in the image.

Object Recognition

With object recognition, models are trained to identify all relevant objects in a given image or video. For example, given an image of a street scene, an object recognition model would return the locations of all objects it has been trained to recognize (e.g. pedestrians, cars, buildings, street signs, etc.)

Object Tracking

Object tracking is the task of taking an initial set of object detections, creating unique IDs for each detection, and then tracking each object as it moves around in a video. For example, given a video of a fulfillment center, an object tracking model would first identify a product, tag it, then track it over time as it is moved around the facility.

Event Detection

With event detection, models are trained to determine when a particular event has occurred. For example, given a video of a retail store, an event detection model would flag when a customer has picked up or bagged a product to enable autonomous checkout systems.

Pose Estimation

With pose estimation, models are trained to detect and predict the position and orientation of a person, object, or keypoints in a given scene. For example, given an ego-centric video of a person opening a door, a pose estimation model will detect and predict the position and orientation of the first person’s hands to unlock and open the door.

Depth Estimation

With depth estimation, models are trained to measure the distance of an object relative to a camera or sensor. Depth can be measured either from monocular (single) or stereo (multiple views) images. Depth estimation is critical to enable applications such as autonomous driving, augmented reality, robotics, and more.

Generation

Generative models or diffusion models are a class of machine learning models that can generate new data based on training data. For more on diffusion models, take a look at our Practical Guide to Diffusion Models.