Introducing: PandaSet. The world's most advanced LiDAR dataset for commercial use
In these unprecedented times, COVID-19 has brought out a renewed and inspiring
sense of collaboration in AI and research communities as we work toward
solving pressing issues. But the pandemic has also exacerbated some of the
difficulties of developing new technologies at scale.
For example, as we shelter in place around the world, the promise of
autonomous vehicles (AVs) to improve access to critical goods and services has
never felt more
relevant. But as we realize more ways these technologies could improve our lives, the
essential data collection and testing that power them have rightly been
suspended to ensure the safety of those involved.
That’s why today we’re launching PandaSet:
a new open-source dataset for training machine learning (ML) models for
autonomous driving released in partnership with the LiDAR manufacturer Hesai.
While many AV companies are turning to complementary techniques and simulated
data to continue their work, there is often no substitute for high-quality
data that captures the complex and often messy reality of driving in the real
world.
High-quality data is crucial to building safe and effective AV systems.
PandaSet is the world’s first publicly available dataset to include both
mechanical spinning and forward-facing LiDARs (Hesai’s Pandar64 and
PandarGT)—allowing ML teams to take advantage of the latest technologies. It
is also the first to be released without any major restrictions on its
commercial use.
There are three reasons why we hope AV teams will find PandaSet to be a
valuable resource: its content, its quality, and its no-cost commercial
license.
Content
Covering some of the most challenging driving conditions for full level 5
autonomy, PandaSet includes complex urban environments, their dense traffic
and pedestrians, steep hills, construction, and a variety of lighting
conditions in the day, dusk and evening.
There are more than 48,000 camera images and over 16,000 LiDAR sweeps—more
than 100 scenes of 8s each. Capturing sequences in busy urban areas also
means there is a high density of useful information, with many more objects
in each frame than in other datasets.
Quality
By combining the strengths of both mechanical spinning and forward-facing
LiDARs, PandaSet captures the complex variables of urban driving in rich
detail.
It also includes 28 different annotation classes for each scene as well as
37 semantic segmentation labels for the majority of scenes. With LiDAR data
far beyond the capabilities of traditional cuboid labeling, it features
Scale’s Point Cloud Segmentation that enables the highest precision and
quality annotation of complex objects, such as smoke or rain.
PandaSet also features Scale’s market-leading
Sensor Fusion technology,
allowing ML teams to blend multiple LiDAR, RADAR and camera inputs into a
single point cloud that allows for the semantic segmentation of different
objects in LiDAR data. By allowing ML teams to exploit their LiDAR data much
more systematically, this makes PandaSet ideal for building
highly-performant autonomous systems.
No-cost commercial license
Many existing open-source datasets have restrictive licensing terms that
allow only research or limited commercial uses. While that can be important
for helping ensure data is used appropriately, we wanted to make PandaSet
available to the entire community, democratizing access to the latest LiDAR
technologies for ML teams around the world at a time when the barriers to
data collection are higher.
We all want to accelerate the safe deployment of AVs—and the need for the
right data has never been more pressing. By filling the gap for AI and ML
developers who might otherwise be unable to build and test new technologies,
we hope PandaSet will provide a useful resource for teams building a future
for mobility that is safer and more accessible for everyone.
You can find more information about PandaSet
here, with dataset
support tools available on