Here at Scale, one of the image annotation services we offer is Cuboid Annotation, which annotates your two-dimensional images with projections of cuboids enclosing objects such as cars, trucks, pedestrians, traffic cones, you name it. With some additional information, we can turn those two-dimensional box annotations into full, three-dimensional boxes, with height, width, depth, rotation, and relative positioning info.
Manual cuboid annotation
With manual cuboid adjustment, scalers simply draw a 2D box representing one side of the cuboid and an additional side of the cuboid:
This often does not result in a "true" cuboid as it is mathematically imprecise. The front face of a "true" cuboid will likely not be a perfect 90 degree rectangle, especially if it isn't facing the camera head-on. Furthermore, edges of a cuboid parallel to the ground should converge to the horizon, whereas the top/bottom edges of the right side of the above annotation are parallel.
In our task response, we provide the image coordinates of the points and edges:
Given the cuboid annotation above, and some additional information (namely, camera parameters and orientation), we automatically generate a more accurate annotation:
The front face is no longer perfectly rectangular, but trapezoidal with the left edge slightly smaller, better reflecting the relative orientation of the car to the camera. The top/bottom edges of the right side now converge to a point on the horizon, as they should.
After cuboid adjustment, we add two new fields points_3d and points_2d to the original annotation:
which represent the 3D coordinates of the vertices1 of the cuboid in space (relative to the position of the camera, in meters) and the pixel coordinates of the resulting 2D projection.
From these eight points in 3D-space it's easy to infer the position, dimensions, and orientation of the resulting cuboid. With these attributes, one can train models such as Magic Leap's Deep Cuboid Detection, which predicts 3d coordinates of cuboid-like objects from a single image. In this way, one can create a system to identify the locations of cars in the world, using only camera images and annotations generated by Scale.
Initial AnnotationAfter Adjustment
Because it's impossible to visually distinguish between one object, and another object twice the size and twice as far away, it's only possible to calculate 3D coordinates up to a scaling factor (i.e. you can multiply the points_3d of the resulting cuboid by an arbitrary value, and it would be just as valid). With additional data, such as depth data from LIDAR or from stereo imaging, one knows the exact distance from the front face of the cuboid and can scale the points accordingly. Alternately, if the elevation of the camera is provided, we scale the resulting cuboids so that the bottom face of each cuboid touches the ground.