At Tesla’s 2019 Autonomy Day, Elon Musk made headlines when he took a high-profile swipe at LiDAR (Light Detection and Ranging) technology, predicting that "anyone relying on LiDAR is doomed". While LiDAR has been widely embraced by self-driving vehicle developers for over a decade, Musk declared that the only hardware Tesla needs is the existing suite of cameras and sensors already installed on their vehicles. The road to autonomy, he argued, lies not in adding more sensors but in the massive amounts of real-world training data Tesla’s fleet collects coupled with state-of-the-art advances in computer vision.
Musk's prediction cast a spotlight on a rapidly growing divide in the world of AV (autonomous vehicle) development: whether to aim for vehicles that, like human drivers, can navigate the world through sight alone or if sensors like LiDAR are still necessary to counter-balance some of the limitations of computer vision. Currently this debate is far from settled as neither approach has yielded large-scale deployments of self-driving vehicles and public data that could be used to compare the technologies is scarce. Since Scale has a suite of data labeling products built for AV developers, we wanted to use our own tools to put the two self-driving philosophies to the test.
First some background: perception systems don’t just learn to understand a highway on their own — they are trained using large datasets of human-verified data. This is classified as a 'supervised learning' system. The training data consists of some output from the car’s sensors (like video imagery of the road ahead or a 3D LiDAR point cloud of the environment around the vehicle) that is sent to a human who annotates the position and type of every object we would like the AV to learn to ‘see’. Since the labeled data is the primary input by which the car is trained to perceive the world, we often see clues of how well an AV’s perception system will function just by looking at the quality of the training data it is generating. To build a sensor perception system that performs with high-enough accuracy to drive a multi-ton robot, having a lot of data is not enough — we must be able to annotate this data at extremely high accuracy levels or the perception system’s performance will begin to regress. This is where our experiment begins! Scale has the tooling to produce data sets from any combination of sensors and we recently provided annotations for nuScenes by Aptiv a 3D dataset created from an AV equipped with both cameras and LiDAR in collaboration with Aptiv. Could we predict out how well a camera-based system might work versus a LiDAR-based system by comparing the training data generated from each? To answer this we took a collection of driving scenes from our 3D dataset, extracted just the 2D video imagery and then re-labeled it to create a 2D dataset similar to what a perception system without LiDAR might be trained on. We then projected  these 2D annotations back onto the original 3D data and compared them object-by-object to see if any accuracy loss occurred.
With the two datasets all labeled and ready to go, we started comparing the results — and immediately noticed some big differences. Many of the annotations that look perfectly reasonable when overlaid over video produced obviously flawed representations when scaled up to 3D. Take the example below — if you only looked at the 2D images on the left both datasets may seem accurate at first glance. But moving around the scene in 3D reveals that the video-only dataset's annotation is both too long and missing an entire side of the car.
Why is the accuracy so much worse in 2D? Is the person who drew the bounding box for this car just doing a bad job? No — it’s actually very challenging to infer precise 3D measurements from a flat image. To draw an orderly 3D cuboid around an irregularly-sized object (cars come in many shapes with all sorts of variations of accessories, not to mention the many other forms of vehicles, pedestrians and wildlife an AV will encounter) you need to know where all the extreme points of the object lie. In a 2D perspective it is guaranteed that some of these points will either blend into or be hidden by the object itself. Take our minivan example, the position of the far-left and near-right edges of the vehicle are easy to find since they are silhouetted against the background of the scene, but there is no obvious visual guideline for where to draw the back-left corner of the vehicle. Even worse for us, the back of the minivan is both sloped and rounded. The annotator can use intuition to try and fill in the blanks but here she ends up underestimating the object’s width which misaligns the cuboid’s rotation and cuts out the left-side of the car when viewed in 3D.
If we take a closer look at the far-left edge we see the 2D annotator also undershot the height because she had to estimate how far out the curved hood of the minivan extends beyond the roof, leading to dramatically overestimated depth in 3D. This leads us to the mathematical basis of another source of inaccuracy — depth information is naturally ‘downscaled’ in a 2D image. As an object gets closer to perpendicular with the horizon, moving the far edge by just a few pixels can massively shift the perceived depth of a cuboid.
To solve for this we could force hard-coded dimensions onto our system (making all cars a certain size) but we’d quickly run into countless exceptions (vehicles with cargo) or classes of objects that have no standard size (like construction zones). This pushes us towards a Machine Learning based solution to derive sizes from objects — but this approach has its own challenges that we discuss later.
In our experience the best way around these issues is to reference high resolution 3D data — this is where LiDAR comes in. Looking at the LiDAR point cloud takes almost all this guesswork out since the captured points trace the edges of both the left and rear wall and we can use them as guidelines to set our cuboid edge against. The depth distortion issue mentioned above is also non-existent in 3D.
Our previous example was comparatively easy when we consider the full range of scenes an AV developers must handle. Throw in some common real-world complexities and the difference between the two datasets gets even more extreme. Here we see a vehicle captured at night and making an unprotected turn in front of the AV — visibility is poor, it is partially occluded by a traffic sign and the oncoming glare of headlights doesn’t help. The labeler with access to 3D data is able to get a much more accurate rotation and depth for the vehicle because the high-mounted LiDAR sensor lets them see ‘over’ the occluding traffic sign to measure where the car ends and how it’s angled.
Of course cars are just one of the many types of objects we encounter while driving — we also need to be aware of smaller vehicles such as bicycles or e-scooters that may lack lighting at night. Here’s a simple test — look at the image below and see if you can find the camouflaged vehicle:
There is indeed an e-scooter and rider, hiding between the pole and some roadside foliage on the right-side of the image. The rider’s dark outfit combined with grainy low-light imagery makes it hard to tell if we are looking at a shadow or a person. This object was entirely missing in our video-only training data and only discovered in the LiDAR annotations.
It would be extremely dangerous for an AV to fail to identify a vehicle like this. Given the ambiguity of the video imagery, a car without access to LiDAR has two options: it’s either going to ignore these objects and risk driving blind to small vehicles when visibility is poor or it can be programmed to be overly-cautious and will frequently hallucinate that moving shadows on foliage are also scooter riders (causing the car to erratically break or swerve in efforts to avoid collisions with the imaginary objects around it). Both of these behaviors are obviously reckless to the other people the AV will be sharing the road with.
We’ve seen that predicting 3D labels purely from 2D sensor data has challenges, but how widespread are these issues? We charted all the cuboid rotation inaccuracies across our dataset and found that the average video-only annotation is .19 radians (10.8 degrees) off from it’s LiDAR-verified counterpart. After further slicing the data, we found that night annotations have a higher average error (.22 radians) than annotations from day scenes (.16) and that accuracy decreases as distance from the camera increases.
To further quantify this pattern, we graded all of the 2D and 3D annotation pairs with a standard metric of quality for object detection tasks - IOU scoring. (IOU or Intersection Over Union is a popular metric for object detection tasks since it measures the ‘difference’ between two shapes, taking into account location, pose and size errors.) The mean score for the entire dataset came out to be 32.1%. For context, we generally aim for an IOU score of 90%+ to consider an annotation ‘correct’.
What does it mean that humans struggle to annotate these types of 2D scenes (while managing to walk around our houses and drive to work without a second thought)? It's really a testament to just how differently our brains go about perceiving the world compared to the software of a self-driving car. When it comes to planning physical movements, we don’t need to perform mathematical calculations in our head about the environment to know to hit the brakes when a driver runs a red-light in front of us.
If your perception system is weak or inaccurate, your ability to forecast the future will be dramatically reduced.
Self-driving cars, on the otherhand, do need to perform these calculations — and that’s largely by design. It would be extremely dangerous to let a predictive system like a neural net (infamously cryptic to debug and prone to confusion) take control of a self-driving car directly (the end-to-end learning approach). The ‘brain’ of a self-driving car is instead separated into stacks of smaller systems with perception at the base, followed by prediction, planning and finally action. Perception is the foundation because later steps like prediction and planning all rely on accurate representations of the current world to correctly forecast where actors will be and how they will interact with both the AV and each other. If your perception system is weak or inaccurate, your ability to forecast the future will be dramatically reduced.
This might matter less in simple cases like highway driving where the range of possible behaviors is relatively small but there are many situations in full autonomous driving that require longer-term forecasting to maneuver safely (i.e. understanding when it is safe to take an unprotected left or move around a double-parked vehicle). Additionally, almost all self-driving stacks visualize the world in top-down perspective for planning purposes, so misjudging the width of a car (as we saw in the first example) can lead the planning system to incorrectly predict what maneuvers other cars on the road have the space to perform or even to propose a path that would lead to the AV side-swiping the other vehicle. While many anti-LiDAR arguments boil down to ‘I can get around fine without strapping a spinning LiDAR sensor to my head, so a good neural net should be able to as well’, AV software architecture is built in a way that requires better predictions and therefore higher accuracy perception than a human can get by with.
Figuring out how to get great annotation precision from 2D data will be a major challenge for developers of non-LiDAR systems. This explains why a chunk of Tesla’s Autonomy Day presentation focused on research they are doing into experimental systems to predict object sizes and positions. One approach that has been discussed recently is to create a pointcloud using stereo cameras (similar to how our eyes use parallax to judge distance). So far this hasn't proved to be a great alternative since you would need unrealistically high-resolution cameras to measure objects at any significant distance. Another approach demoed was to use an additional layer of machine learning to understand size and depth. Ultimately this means leaning even more heavily on neural nets — with their unpredictable and extreme failure cases — for safety critical systems. LiDAR-based systems can directly measure distance and size, making the vehicle’s perception system much more robust to neural net failures. Tesla did briefly show an example of top-down depth predictions made using their neural net-powered systems. Even in this fairly simple scene (daytime, on a highway), the predicted vehicle sizes display significant size and angle distortions.
While 2D annotations may look superficially accurate, they often have deeper inaccuracies hiding beneath the surface. Inaccurate data will harm the confidence of ML models whose outputs cascade down into the vehicle’s prediction and planning software. Without a breakthrough in computer vision research, it will likely be a struggle for these types of driving systems to achieve true autonomy, where vehicles must perform thousands of predictions per mile with no room for error.
However 2D annotations probably still have a place either as part of a larger sensor system or for simpler tasks such as classifying objects for basic lane-keeping/collision avoidance or highway driving.
It’s always better to have multiple sensor modalities available. One of the benefits of combining data from camera and LiDAR is that in situations when one sensor type is ‘blind’ (such as a car hidden by a traffic sign or the time it takes a camera to adjust exposure when going under a bridge) we can rely on the other sensor to fill in missing information.
At a broader level, our findings support the idea of a virtuous cycle in ML development: more robust sensors lead to much higher accuracy training data which means your perception model will perform better, which in turn may eventually lower your reliance on any one sensor type. But the inverse effect exists too: even if a safe self-driving system is theoretically possible without LiDAR, getting great training data will be much tougher with cameras alone. Barring a paradigm change in ML techniques, a high volume of mediocre training data will only take you so far. Developers who are deprived of higher quality data will face an uphill battle to train their perception systems to the level of accuracy it will take to get true autonomy safely off the ground.
^ We calibrate the 2D cuboids to pseudo-3D using the Nuscenes camera intrinsics data and then use the camera extrinsics to scale up the cuboids proportionally in all directions to a known reference point in 3D (in our case this is the closest ground point) to get comparable object pairs.