What You See is What You Get: Exploiting Visibility for 3D Object Detection (1912.04986v3)

Published 10 Dec 2019 in cs.CV and cs.RO

Abstract: Recent advances in 3D sensing have created unique challenges for computer vision. One fundamental challenge is finding a good representation for 3D sensor data. Most popular representations (such as PointNet) are proposed in the context of processing truly 3D data (e.g. points sampled from mesh models), ignoring the fact that 3D sensored data such as a LiDAR sweep is in fact 2.5D. We argue that representing 2.5D data as collections of (x, y, z) points fundamentally destroys hidden information about freespace. In this paper, we demonstrate such knowledge can be efficiently recovered through 3D raycasting and readily incorporated into batch-based gradient learning. We describe a simple approach to augmenting voxel-based networks with visibility: we add a voxelized visibility map as an additional input stream. In addition, we show that visibility can be combined with two crucial modifications common to state-of-the-art 3D detectors: synthetic data augmentation of virtual objects and temporal aggregation of LiDAR sweeps over multiple time frames. On the NuScenes 3D detection benchmark, we show that, by adding an additional stream for visibility input, we can significantly improve the overall detection accuracy of a state-of-the-art 3D detector.

Authors (4)

Peiyun Hu (13 papers)
Jason Ziglar (4 papers)
David Held (81 papers)
Deva Ramanan (152 papers)

Citations (114)

View on Semantic Scholar

Summary

Exploiting Visibility in 3D Object Detection

The paper "What You See is What You Get: Exploiting Visibility for 3D Object Detection" addresses challenges associated with processing 3D sensor data, specifically from LiDAR, for effective 3D object detection tasks. This paper contributes by harnessing visibility information intrinsic to 2.5D data for improving detection frameworks.

Summary

Key Contributions

Visibility Representation: The authors reconceptualize LiDAR data as 2.5D, emphasizing the importance of visibility due to occlusion effects. Traditional representations such as point clouds or mesh models disregard the fact that 3D sensor data captures visibility constraints, which can be crucial for applications like autonomous navigation.
Integration with Voxel-based Networks: The paper proposes augmenting voxel-based networks using a visibility map as an additional data stream. By employing raycasting techniques, the authors extract visibility maps and utilize these maps alongside synthetic data and temporal aggregations to enhance 3D detection workflows.
Detection Framework Enhancements: Through three main innovations — raycasting for visibility computation, augmenting input streams with visibility data, and combining visibility with virtual object augmentation and temporal aggregation — the paper demonstrates significant improvements in object detection accuracy on the NuScenes dataset.

Technical Approach

Visibility through Raycasting: LiDAR data inherently involves a process of raycasting, which determines the line-of-sight occlusions when capturing scenes. By computationally replicating this process, visibility information can be captured efficiently as a volumetric map, indicating occupied, free, or unknown voxel states.
Augmented Object Handling: In order to address issues with virtual object placements in augmented datasets, the paper suggests "culling" (removing occluded objects) and "drilling" (removing occluding original scene points) strategies, effectively leveraging visibility for realistic data augmentations.
Temporal Aggregation: By extending visibility representation temporally, the authors apply strategies akin to online occupancy mapping, delineating points across multiple time frames and calculating probabilities of occupancy.

Results

The experimental results demonstrate notable improvements in accuracy across multiple object classes in the NuScenes benchmark. The methods showed substantial gains, particularly in objects with varied visibility conditions. The paper reports a higher mean Average Precision (mAP) when incorporating visibility representation compared to baseline PointPillars.

Implications and Future Directions

The paper's findings have practical implications in autonomous vehicle perception, where understanding freespace and recognizing occlusions are critical. Furthermore, it inspires future research directions where visibility can be further exploited, such as integration with SLAM frameworks or enhancing real-time navigation systems by leveraging visibility-aware 3D data processing. Additionally, the work indicates possibilities for adapting and refining data augmentation techniques to preserve geometric consistency in training datasets further, potentially improving the robustness and generalizability of AI models in complex environments.

Conclusion

By articulating LiDAR data's unique 2.5D characteristics and integrating visibility effectively into neural processing frameworks, the paper underscores significant improvements in 3D object detection capability. The extension of this domain knowledge could broaden the theoretical understanding of depth data representation and foster advancements across several AI applications necessitating spatial awareness and perception.

Related Papers

YouTube

Show All Videos