MVLidarNet: Real-Time Multi-Class Scene Understanding for Autonomous Driving Using Multiple Views (2006.05518v2)

Published 9 Jun 2020 in cs.CV and cs.RO

Abstract: Autonomous driving requires the inference of actionable information such as detecting and classifying objects, and determining the drivable space. To this end, we present Multi-View LidarNet (MVLidarNet), a two-stage deep neural network for multi-class object detection and drivable space segmentation using multiple views of a single LiDAR point cloud. The first stage processes the point cloud projected onto a perspective view in order to semantically segment the scene. The second stage then processes the point cloud (along with semantic labels from the first stage) projected onto a bird's eye view, to detect and classify objects. Both stages use an encoder-decoder architecture. We show that our multi-view, multi-stage, multi-class approach is able to detect and classify objects while simultaneously determining the drivable space using a single LiDAR scan as input, in challenging scenes with more than one hundred vehicles and pedestrians at a time. The system operates efficiently at 150 fps on an embedded GPU designed for a self-driving car, including a postprocessing step to maintain identities over time. We show results on both KITTI and a much larger internal dataset, thus demonstrating the method's ability to scale by an order of magnitude.

Citations (32)

View on Semantic Scholar

Summary

The paper presents a dual-stage network that fuses perspective semantic segmentation with bird’s eye view detection for multi-class LiDAR scene understanding.
It employs encoder-decoder architectures to achieve competitive KITTI benchmark performance while processing frames at 150 fps on embedded GPUs.
The system demonstrates robust detection of pedestrians and vehicles in dense urban scenes, enhancing decision-making in autonomous driving.

MVLidarNet: Advancements in Real-Time Multi-Class Scene Understanding with LiDAR

The paper "MVLidarNet: Real-Time Multi-Class Scene Understanding for Autonomous Driving Using Multiple Views" introduces a dual-stage, deep neural network system designed to enhance multi-class scene comprehension in autonomous vehicles. The system, known as MVLidarNet, leverages multi-view representations of LiDAR data to simultaneously perform object detection, classification, and drivable space determination in real-time, directly addressing key challenges in the field of autonomous driving.

System Overview

MVLidarNet employs a two-stage architecture:

Perspective View Semantic Segmentation: The first stage processes the LiDAR point cloud projected onto a perspective view to perform semantic segmentation of the scene. It utilizes an encoder-decoder architecture akin to a feature pyramid network (FPN). The inclusion of semantic segmentation enables more granular understanding of objects such as pedestrians and cyclists, whose detection benefits from perspective shape information.
Bird's Eye View Object Detection and Classification: The semantic labels from the perspective stage, reprojected onto a bird's eye view (BEV) along with height information from the LiDAR data, are processed in the second stage. This stage is also an FPN-based encoder-decoder architecture that outputs both class predictions and bounding box parameters. This BEV perspective is instrumental in spatial reasoning and reducing computational overhead due to its 2D characteristic.

Experimental Results

MVLidarNet demonstrably achieves competitive performance in scene understanding at an impressive processing speed of 150 frames per second on embedded GPUs, an order of magnitude faster than many contemporary methods. The paper presents substantive results on both the KITTI benchmark and an expansive internal dataset. Notably, the system maintains high efficacy in densely populated urban environments with scenes containing more than 100 vehicles and pedestrians simultaneously.

KITTI Benchmark: On the KITTI dataset, MVLidarNet's performance was competitive with state-of-the-art solutions in BEV object detection for cars, with a notable speed advantage (7ms per frame). The simplified architecture, avoiding the computational complexities intrinsic to voxel-based methods, plays a critical role in achieving this efficiency.
Internal Dataset Evaluation: The internal dataset showcases the system's prowess in detecting vehicles and especially pedestrians, where challenges escalate due to scale and occlusion. The results highlight the efficacy of semantic input in pedestrian detection when combined with height data, reflecting a substantial improvement over systems relying solely on geometric information.

Implications and Future Directions

MVLidarNet's architecture presents substantive implications for real-time perception systems in autonomous driving. The ability to accurately and quickly identify multiple classes of objects and drivable surfaces enhances decision-making capabilities in autonomous systems, particularly in complex scenarios. Future research directions could include:

End-to-End Training: Integrating the stages for end-to-end training could potentially enhance system coherence and performance, provided a unified dataset becomes available.
Extended Class Support: Expanding the supported object classes could further enrich the semantic map of the environment, aiding comprehensive navigation tasks.
Real-Time Adaptations: Further optimizations and architectural adjustments could assist in maintaining accuracy while expanding the operational capabilities to new hardware platforms or increased environmental complexity.

The paper provides a compelling foundation for the acceleration and refinement of perception modules for autonomous vehicles, promising advancements in their real-time operational intelligence.

PDF Markdown

Related Papers

YouTube

Show All Videos