At Tesla’s 2019 Autonomy Day, Elon Musk made headlines when he took a
high-profile swipe at LiDAR (Light Detection and Ranging) technology,
predicting that "anyone relying on LiDAR is doomed". While LiDAR has been
by self-driving vehicle developers for over a decade, Musk declared that the
only hardware Tesla needs is the existing suite of cameras and sensors already
installed on their vehicles. The road to autonomy, he argued, lies not in
adding more sensors but in the massive amounts of real-world training data
Tesla’s fleet collects coupled with state-of-the-art advances in computer
vision.
Musk's prediction cast a spotlight on a rapidly growing divide in the world of
AV (autonomous vehicle) development: whether to aim for vehicles that, like
human drivers, can navigate the world through sight alone or if sensors like
LiDAR are still necessary to counter-balance some of the limitations of
computer vision. Currently this debate is far from settled as neither approach
has yielded large-scale deployments of self-driving vehicles and public data
that could be used to compare the technologies is scarce. Since Scale has a
suite of data labeling products built for AV developers, we wanted to use our
own tools to put the two self-driving philosophies to the test.
The Devil Is In The Data
First some background: perception systems don’t just learn to understand a
highway on their own — they are trained using large datasets of human-verified
data. This is classified as a 'supervised learning' system. The training data
consists of some output from the car’s sensors (like video imagery of the road
ahead or a 3D LiDAR point cloud of the environment around the vehicle) that is
sent to a human who annotates the position and type of every object we would
like the AV to learn to ‘see’. Since the labeled data is the primary input by
which the car is trained to perceive the world, we often see clues of how well
an AV’s perception system will function just by looking at the quality of the
training data it is generating. To build a sensor perception system that
performs with high-enough accuracy to drive a multi-ton robot, having a lot of
data is not enough — we must be able to annotate this data at extremely high
accuracy levels or the perception system’s performance will begin to regress.
This is where our experiment begins! Scale has the tooling to produce data
sets from any combination of sensors and we recently provided annotations for
3D dataset created from an AV equipped with both cameras and LiDAR in
collaboration with Aptiv. Could we predict out how well a camera-based system
might work versus a LiDAR-based system by comparing the training data
generated from each? To answer this we took a collection of driving scenes
from our 3D dataset, extracted just the 2D video imagery and then re-labeled
it to create a 2D dataset similar to what a perception system without LiDAR
might be trained on. We then projected
[1] these 2D annotations back onto the original
3D data and compared them object-by-object to see if any accuracy loss
occurred.
Squaring Off: Camera vs LiDAR
With the two datasets all labeled and ready to go, we started comparing the
results — and immediately noticed some big differences. Many of the
annotations that look perfectly reasonable when overlaid over video produced
obviously flawed representations when scaled up to 3D. Take the example below
— if you only looked at the 2D images on the left both datasets may seem
accurate at first glance. But moving around the scene in 3D reveals that the
video-only dataset's annotation is both too long and missing an entire side of
the car.
A vehicle annotated from video-only.
Top-down view overlaid on LiDAR.
The same vehicle annotated from a combo of LiDAR + Video more accurately
captures the width and length of the vehicle.
Top-down view with LiDAR.
Why is the accuracy so much worse in 2D? Is the person who drew the bounding
box for this car just doing a bad job? No — it’s actually very challenging to
infer precise 3D measurements from a flat image. To draw an orderly 3D cuboid
around an irregularly-sized object (cars come in many shapes with all sorts of
variations of accessories, not to mention the many other forms of vehicles,
pedestrians and wildlife an AV will encounter) you need to know where all the
extreme points of the object lie. In a 2D perspective it is guaranteed that
some of these points will either blend into or be hidden by the object itself.
Take our minivan example, the position of the far-left and near-right edges of
the vehicle are easy to find since they are silhouetted against the background
of the scene, but there is no obvious visual guideline for where to draw the
back-left corner of the vehicle. Even worse for us, the back of the minivan is
both sloped and rounded. The annotator can use intuition to try and fill in
the blanks but here she ends up underestimating the object’s width which
misaligns the cuboid’s rotation and cuts out the left-side of the car when
viewed in 3D.
If we take a closer look at the far-left edge we see the 2D annotator also
undershot the height because she had to estimate how far out the curved hood
of the minivan extends beyond the roof, leading to dramatically overestimated
depth in 3D. This leads us to the mathematical basis of another source of
inaccuracy — depth information is naturally ‘downscaled’ in a 2D image. As an
object gets closer to perpendicular with the horizon, moving the far edge by
just a few pixels can massively shift the perceived depth of a cuboid.
In 2D a few pixels of inaccuracy between the LiDAR + Camera annotation
(white) and the Camera-only annotation (orange) becomes a much larger error
when projected into 3D.
To solve for this we could force hard-coded dimensions onto our system (making
all cars a certain size) but we’d quickly run into countless exceptions
(vehicles with cargo) or classes of objects that have no standard size (like
construction zones). This pushes us towards a Machine Learning based solution
to derive sizes from objects — but this approach has its own challenges that
we discuss later.
In our experience the best way around these issues is to reference high
resolution 3D data — this is where LiDAR comes in. Looking at the LiDAR point
cloud takes almost all this guesswork out since the captured points trace the
edges of both the left and rear wall and we can use them as guidelines to set
our cuboid edge against. The depth distortion issue mentioned above is also
non-existent in 3D.
Round 2: Night Drive
Our previous example was comparatively easy when we consider the full range of
scenes an AV developers must handle. Throw in some common real-world
complexities and the difference between the two datasets gets even more
extreme. Here we see a vehicle captured at night and making an unprotected
turn in front of the AV — visibility is poor, it is partially occluded by a
traffic sign and the oncoming glare of headlights doesn’t help. The labeler
with access to 3D data is able to get a much more accurate rotation and depth
for the vehicle because the high-mounted LiDAR sensor lets them see ‘over’ the
occluding traffic sign to measure where the car ends and how it’s angled.
A vehicle at night annotated from video-only.
Top-down view projected onto LiDAR.
The same vehicle annotated from a combo of LiDAR + Video for comparison.
Top-down view projected onto LiDAR.
Of course cars are just one of the many types of objects we encounter while
driving — we also need to be aware of smaller vehicles such as bicycles or
e-scooters that may lack lighting at night. Here’s a simple test — look at the
image below and see if you can find the camouflaged vehicle:
There is indeed an e-scooter and rider, hiding between the pole and some
roadside foliage on the right-side of the image. The rider’s dark outfit
combined with grainy low-light imagery makes it hard to tell if we are looking
at a shadow or a person. This object was entirely missing in our video-only
training data and only discovered in the LiDAR annotations.
Viewed by camera the scooter and rider are largely hidden by the
background foliage.
Viewed top-down in LiDAR, the object points indicate the presence of the
scooter and rider.
It would be extremely dangerous for an AV to fail to identify a vehicle like
this. Given the ambiguity of the video imagery, a car without access to LiDAR
has two options: it’s either going to ignore these objects and risk driving
blind to small vehicles when visibility is poor or it can be programmed to be
overly-cautious and will frequently hallucinate that moving shadows on foliage
are also scooter riders (causing the car to erratically break or swerve in
efforts to avoid collisions with the imaginary objects around it). Both of
these behaviors are obviously reckless to the other people the AV will be
sharing the road with.
Perception and Prediction
We’ve seen that predicting 3D labels purely from 2D sensor data has
challenges, but how widespread are these issues? We charted all the cuboid
rotation inaccuracies across our dataset and found that the average video-only
annotation is .19 radians (10.8 degrees) off from it’s LiDAR-verified
counterpart. After further slicing the data, we found that night annotations
have a higher average error (.22 radians) than annotations from day scenes
(.16) and that accuracy decreases as distance from the camera increases.
To further quantify this pattern, we graded all of the 2D and 3D annotation
pairs with a standard metric of quality for object detection tasks -
IOU scoring. (IOU or Intersection Over Union is a popular metric for object detection
tasks since it measures the ‘difference’ between two shapes, taking into
account location, pose and size errors.) The mean score for the entire dataset
came out to be 32.1%. For context, we generally aim for an IOU score of 90%+
to consider an annotation ‘correct’.
Implications
What does it mean that humans struggle to annotate these types of 2D scenes
(while managing to walk around our houses and drive to work without a second
thought)? It's really a testament to just how differently our brains go about
perceiving the world compared to the software of a self-driving car. When it
comes to planning physical movements, we don’t need to perform mathematical
calculations in our head about the environment to know to hit the brakes when
a driver runs a red-light in front of us.
If your perception system is weak or inaccurate, your ability to forecast the
future will be dramatically reduced.
Self-driving cars, on the otherhand, do need to perform these calculations —
and that’s largely by design. It would be extremely dangerous to let a
predictive system like a neural net (infamously cryptic to debug and prone to
confusion) take control of a self-driving car directly (the end-to-end
learning approach). The ‘brain’ of a self-driving car is instead separated
into stacks of smaller systems with perception at the base, followed by
prediction, planning and finally action. Perception is the foundation because
later steps like prediction and planning all rely on accurate representations
of the current world to correctly forecast where actors will be and how they
will interact with both the AV and each other. If your perception system is
weak or inaccurate, your ability to forecast the future will be dramatically
reduced.
This might matter less in simple cases like highway driving where the range of
possible behaviors is relatively small but there are many situations in full
autonomous driving that require longer-term forecasting to maneuver safely
(i.e. understanding when it is safe to take an unprotected left or move around
a double-parked vehicle). Additionally, almost all self-driving stacks
visualize the world in top-down perspective for planning purposes, so
misjudging the width of a car (as we saw in the first example) can lead the
planning system to incorrectly predict what maneuvers other cars on the road
have the space to perform or even to propose a path that would lead to the AV
side-swiping the other vehicle. While many anti-LiDAR arguments boil down to
‘I can get around fine without strapping a spinning LiDAR sensor to my head,
so a good neural net should be able to as well’, AV software architecture is
built in a way that requires better predictions and therefore higher accuracy
perception than a human can get by with.
Figuring out how to get great annotation precision from 2D data will be a
major challenge for developers of non-LiDAR systems. This explains why a chunk
of Tesla’s Autonomy Day presentation focused on research they are doing into
experimental systems to predict object sizes and positions. One approach that
has been
is to create a pointcloud using stereo cameras (similar to how our eyes use
parallax to judge distance). So far this hasn't proved to be a great
alternative since you would need unrealistically high-resolution cameras to
measure objects at any significant distance. Another approach demoed was to
use an additional layer of machine learning to understand size and depth.
Ultimately this means leaning even more heavily on neural nets — with their
unpredictable and extreme
— for safety critical systems. LiDAR-based systems can directly measure
distance and size, making the vehicle’s perception system much more robust to
neural net failures. Tesla did briefly show an example of top-down depth
predictions made using their neural net-powered systems. Even in this fairly
simple scene (daytime, on a highway), the predicted vehicle sizes display
significant size and angle distortions.
Tesla’s predicted bounding boxes for vehicles based on camera data. When
viewed from top-down perspective, the vehicle in the left lane exhibits
depth distortions as it moves away from the camera and the vehicle in the
right lane shows width and rotation inaccuracies. Source: Tesla Autonomy Day
2019
Takeaways
- While 2D annotations may look superficially accurate, they often have deeper
- inaccuracies hiding beneath the surface. Inaccurate data will harm the
- confidence of ML models whose outputs cascade down into the vehicle’s
- prediction and planning software. Without a breakthrough in computer vision
- research, it will likely be a struggle for these types of driving systems to
- achieve true autonomy, where vehicles must perform thousands of predictions
- per mile with no room for error.
- However 2D annotations probably still have a place either as part of a
- larger sensor system or for simpler tasks such as classifying objects for
- basic lane-keeping/collision avoidance or highway driving.
- It’s always better to have multiple sensor modalities available. One of the
- benefits of combining data from camera and LiDAR is that in situations when
- one sensor type is ‘blind’ (such as a car hidden by a traffic sign or the
- time it takes a camera to adjust exposure when going under a bridge) we can
- rely on the other sensor to fill in missing information.
- At a broader level, our findings support the idea of a virtuous cycle in ML
- development: more robust sensors lead to much higher accuracy training data
- which means your perception model will perform better, which in turn may
- eventually lower your reliance on any one sensor type. But the inverse
- effect exists too: even if a safe self-driving system is theoretically
- possible without LiDAR, getting great training data will be much tougher
- with cameras alone. Barring a paradigm change in ML techniques, a high
- volume of mediocre training data will only take you so far. Developers who
- are deprived of higher quality data will face an uphill battle to train
- their perception systems to the level of accuracy it will take to get true
- autonomy safely off the ground.
Notes
- ^ We calibrate the 2D cuboids to pseudo-3D using
- the Nuscenes camera intrinsics data and then use the camera extrinsics to
- scale up the cuboids proportionally in all directions to a known reference
- point in 3D (in our case this is the closest ground point) to get comparable
- object pairs.