General
General

Is Elon Wrong About LiDAR?

byon August 12, 2019


At Tesla’s 2019 Autonomy Day, Elon Musk made headlines when he took a

high-profile swipe at LiDAR (Light Detection and Ranging) technology,

predicting that "anyone relying on LiDAR is doomed". While LiDAR has been

widely embraced

by self-driving vehicle developers for over a decade, Musk declared that the

only hardware Tesla needs is the existing suite of cameras and sensors already

installed on their vehicles. The road to autonomy, he argued, lies not in

adding more sensors but in the massive amounts of real-world training data

Tesla’s fleet collects coupled with state-of-the-art advances in computer

vision.


Musk's prediction cast a spotlight on a rapidly growing divide in the world of

AV (autonomous vehicle) development: whether to aim for vehicles that, like

human drivers, can navigate the world through sight alone or if sensors like

LiDAR are still necessary to counter-balance some of the limitations of

computer vision. Currently this debate is far from settled as neither approach

has yielded large-scale deployments of self-driving vehicles and public data

that could be used to compare the technologies is scarce. Since Scale has a

suite of data labeling products built for AV developers, we wanted to use our

own tools to put the two self-driving philosophies to the test.

The Devil Is In The Data



First some background: perception systems don’t just learn to understand a

highway on their own — they are trained using large datasets of human-verified

data. This is classified as a 'supervised learning' system. The training data

consists of some output from the car’s sensors (like video imagery of the road

ahead or a 3D LiDAR point cloud of the environment around the vehicle) that is

sent to a human who annotates the position and type of every object we would

like the AV to learn to ‘see’. Since the labeled data is the primary input by

which the car is trained to perceive the world, we often see clues of how well

an AV’s perception system will function just by looking at the quality of the

training data it is generating. To build a sensor perception system that

performs with high-enough accuracy to drive a multi-ton robot, having a lot of

data is not enough — we must be able to annotate this data at extremely high

accuracy levels or the perception system’s performance will begin to regress.

This is where our experiment begins! Scale has the tooling to produce data

sets from any combination of sensors and we recently provided annotations for

nuScenes by Aptiv a

3D dataset created from an AV equipped with both cameras and LiDAR in

collaboration with Aptiv. Could we predict out how well a camera-based system

might work versus a LiDAR-based system by comparing the training data

generated from each? To answer this we took a collection of driving scenes

from our 3D dataset, extracted just the 2D video imagery and then re-labeled

it to create a 2D dataset similar to what a perception system without LiDAR

might be trained on. We then projected

[1] these 2D annotations back onto the original

3D data and compared them object-by-object to see if any accuracy loss

occurred.

Squaring Off: Camera vs LiDAR



With the two datasets all labeled and ready to go, we started comparing the

results — and immediately noticed some big differences. Many of the

annotations that look perfectly reasonable when overlaid over video produced

obviously flawed representations when scaled up to 3D. Take the example below

— if you only looked at the 2D images on the left both datasets may seem

accurate at first glance. But moving around the scene in 3D reveals that the

video-only dataset's annotation is both too long and missing an entire side of

the car.

A vehicle annotated from video-onlyA vehicle annotated from video-only.

A vehicle annotated from video-onlyTop-down view overlaid on LiDAR.


A vehicle annotated from a combo of LiDAR + Video

The same vehicle annotated from a combo of LiDAR + Video more accurately

captures the width and length of the vehicle.

A vehicle annotated from a combo of LiDAR + VideoTop-down view with LiDAR.



Why is the accuracy so much worse in 2D? Is the person who drew the bounding

box for this car just doing a bad job? No — it’s actually very challenging to

infer precise 3D measurements from a flat image. To draw an orderly 3D cuboid

around an irregularly-sized object (cars come in many shapes with all sorts of

variations of accessories, not to mention the many other forms of vehicles,

pedestrians and wildlife an AV will encounter) you need to know where all the

extreme points of the object lie. In a 2D perspective it is guaranteed that

some of these points will either blend into or be hidden by the object itself.

Take our minivan example, the position of the far-left and near-right edges of

the vehicle are easy to find since they are silhouetted against the background

of the scene, but there is no obvious visual guideline for where to draw the

back-left corner of the vehicle. Even worse for us, the back of the minivan is

both sloped and rounded. The annotator can use intuition to try and fill in

the blanks but here she ends up underestimating the object’s width which

misaligns the cuboid’s rotation and cuts out the left-side of the car when

viewed in 3D.


If we take a closer look at the far-left edge we see the 2D annotator also

undershot the height because she had to estimate how far out the curved hood

of the minivan extends beyond the roof, leading to dramatically overestimated

depth in 3D. This leads us to the mathematical basis of another source of

inaccuracy — depth information is naturally ‘downscaled’ in a 2D image. As an

object gets closer to perpendicular with the horizon, moving the far edge by

just a few pixels can massively shift the perceived depth of a cuboid.

In 2D a few pixels of inaccuracy between the LiDAR + Camera annotation

(white) and the Camera-only annotation (orange) becomes a much larger error

when projected into 3D.



To solve for this we could force hard-coded dimensions onto our system (making

all cars a certain size) but we’d quickly run into countless exceptions

(vehicles with cargo) or classes of objects that have no standard size (like

construction zones). This pushes us towards a Machine Learning based solution

to derive sizes from objects — but this approach has its own challenges that

we discuss later.


In our experience the best way around these issues is to reference high

resolution 3D data — this is where LiDAR comes in. Looking at the LiDAR point

cloud takes almost all this guesswork out since the captured points trace the

edges of both the left and rear wall and we can use them as guidelines to set

our cuboid edge against. The depth distortion issue mentioned above is also

non-existent in 3D.

Round 2: Night Drive



Our previous example was comparatively easy when we consider the full range of

scenes an AV developers must handle. Throw in some common real-world

complexities and the difference between the two datasets gets even more

extreme. Here we see a vehicle captured at night and making an unprotected

turn in front of the AV — visibility is poor, it is partially occluded by a

traffic sign and the oncoming glare of headlights doesn’t help. The labeler

with access to 3D data is able to get a much more accurate rotation and depth

for the vehicle because the high-mounted LiDAR sensor lets them see ‘over’ the

occluding traffic sign to measure where the car ends and how it’s angled.

A vehicle annotated from video-onlyA vehicle at night annotated from video-only.

A vehicle annotated from video-onlyTop-down view projected onto LiDAR.


A vehicle annotated from video-only

The same vehicle annotated from a combo of LiDAR + Video for comparison.

A vehicle annotated from video-onlyTop-down view projected onto LiDAR.



Of course cars are just one of the many types of objects we encounter while

driving — we also need to be aware of smaller vehicles such as bicycles or

e-scooters that may lack lighting at night. Here’s a simple test — look at the

image below and see if you can find the camouflaged vehicle:



There is indeed an e-scooter and rider, hiding between the pole and some

roadside foliage on the right-side of the image. The rider’s dark outfit

combined with grainy low-light imagery makes it hard to tell if we are looking

at a shadow or a person. This object was entirely missing in our video-only

training data and only discovered in the LiDAR annotations.

A vehicle annotated from video-only

Viewed by camera the scooter and rider are largely hidden by the

background foliage.

A vehicle annotated from video-only

Viewed top-down in LiDAR, the object points indicate the presence of the

scooter and rider.



It would be extremely dangerous for an AV to fail to identify a vehicle like

this. Given the ambiguity of the video imagery, a car without access to LiDAR

has two options: it’s either going to ignore these objects and risk driving

blind to small vehicles when visibility is poor or it can be programmed to be

overly-cautious and will frequently hallucinate that moving shadows on foliage

are also scooter riders (causing the car to erratically break or swerve in

efforts to avoid collisions with the imaginary objects around it). Both of

these behaviors are obviously reckless to the other people the AV will be

sharing the road with.

Perception and Prediction



We’ve seen that predicting 3D labels purely from 2D sensor data has

challenges, but how widespread are these issues? We charted all the cuboid

rotation inaccuracies across our dataset and found that the average video-only

annotation is .19 radians (10.8 degrees) off from it’s LiDAR-verified

counterpart. After further slicing the data, we found that night annotations

have a higher average error (.22 radians) than annotations from day scenes

(.16) and that accuracy decreases as distance from the camera increases.



To further quantify this pattern, we graded all of the 2D and 3D annotation

pairs with a standard metric of quality for object detection tasks -

IOU scoring. (IOU or Intersection Over Union is a popular metric for object detection

tasks since it measures the ‘difference’ between two shapes, taking into

account location, pose and size errors.) The mean score for the entire dataset

came out to be 32.1%. For context, we generally aim for an IOU score of 90%+

to consider an annotation ‘correct’.


Implications



What does it mean that humans struggle to annotate these types of 2D scenes

(while managing to walk around our houses and drive to work without a second

thought)? It's really a testament to just how differently our brains go about

perceiving the world compared to the software of a self-driving car. When it

comes to planning physical movements, we don’t need to perform mathematical

calculations in our head about the environment to know to hit the brakes when

a driver runs a red-light in front of us.


If your perception system is weak or inaccurate, your ability to forecast the

future will be dramatically reduced.


Self-driving cars, on the otherhand, do need to perform these calculations —

and that’s largely by design. It would be extremely dangerous to let a

predictive system like a neural net (infamously cryptic to debug and prone to

confusion) take control of a self-driving car directly (the end-to-end

learning approach). The ‘brain’ of a self-driving car is instead separated

into stacks of smaller systems with perception at the base, followed by

prediction, planning and finally action. Perception is the foundation because

later steps like prediction and planning all rely on accurate representations

of the current world to correctly forecast where actors will be and how they

will interact with both the AV and each other. If your perception system is

weak or inaccurate, your ability to forecast the future will be dramatically

reduced.


This might matter less in simple cases like highway driving where the range of

possible behaviors is relatively small but there are many situations in full

autonomous driving that require longer-term forecasting to maneuver safely

(i.e. understanding when it is safe to take an unprotected left or move around

a double-parked vehicle). Additionally, almost all self-driving stacks

visualize the world in top-down perspective for planning purposes, so

misjudging the width of a car (as we saw in the first example) can lead the

planning system to incorrectly predict what maneuvers other cars on the road

have the space to perform or even to propose a path that would lead to the AV

side-swiping the other vehicle. While many anti-LiDAR arguments boil down to

‘I can get around fine without strapping a spinning LiDAR sensor to my head,

so a good neural net should be able to as well’, AV software architecture is

built in a way that requires better predictions and therefore higher accuracy

perception than a human can get by with.


Figuring out how to get great annotation precision from 2D data will be a

major challenge for developers of non-LiDAR systems. This explains why a chunk

of Tesla’s Autonomy Day presentation focused on research they are doing into

experimental systems to predict object sizes and positions. One approach that

has been

discussed recently

is to create a pointcloud using stereo cameras (similar to how our eyes use

parallax to judge distance). So far this hasn't proved to be a great

alternative since you would need unrealistically high-resolution cameras to

measure objects at any significant distance. Another approach demoed was to

use an additional layer of machine learning to understand size and depth.

Ultimately this means leaning even more heavily on neural nets — with their

unpredictable and extreme

failure cases

— for safety critical systems. LiDAR-based systems can directly measure

distance and size, making the vehicle’s perception system much more robust to

neural net failures. Tesla did briefly show an example of top-down depth

predictions made using their neural net-powered systems. Even in this fairly

simple scene (daytime, on a highway), the predicted vehicle sizes display

significant size and angle distortions.

Tesla’s predicted bounding boxes for vehicles based on camera data. When

viewed from top-down perspective, the vehicle in the left lane exhibits

depth distortions as it moves away from the camera and the vehicle in the

right lane shows width and rotation inaccuracies. Source: Tesla Autonomy Day

2019


Takeaways



  • While 2D annotations may look superficially accurate, they often have deeper
  • inaccuracies hiding beneath the surface. Inaccurate data will harm the
  • confidence of ML models whose outputs cascade down into the vehicle’s
  • prediction and planning software. Without a breakthrough in computer vision
  • research, it will likely be a struggle for these types of driving systems to
  • achieve true autonomy, where vehicles must perform thousands of predictions
  • per mile with no room for error.

  • However 2D annotations probably still have a place either as part of a
  • larger sensor system or for simpler tasks such as classifying objects for
  • basic lane-keeping/collision avoidance or highway driving.

  • It’s always better to have multiple sensor modalities available. One of the
  • benefits of combining data from camera and LiDAR is that in situations when
  • one sensor type is ‘blind’ (such as a car hidden by a traffic sign or the
  • time it takes a camera to adjust exposure when going under a bridge) we can
  • rely on the other sensor to fill in missing information.

  • At a broader level, our findings support the idea of a virtuous cycle in ML
  • development: more robust sensors lead to much higher accuracy training data
  • which means your perception model will perform better, which in turn may
  • eventually lower your reliance on any one sensor type. But the inverse
  • effect exists too: even if a safe self-driving system is theoretically
  • possible without LiDAR, getting great training data will be much tougher
  • with cameras alone. Barring a paradigm change in ML techniques, a high
  • volume of mediocre training data will only take you so far. Developers who
  • are deprived of higher quality data will face an uphill battle to train
  • their perception systems to the level of accuracy it will take to get true
  • autonomy safely off the ground.

Notes


  1. ^ We calibrate the 2D cuboids to pseudo-3D using
  2. the Nuscenes camera intrinsics data and then use the camera extrinsics to
  3. scale up the cuboids proportionally in all directions to a known reference
  4. point in 3D (in our case this is the closest ground point) to get comparable
  5. object pairs.



The future of your industry starts here.