VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion (2302.12251v2)

Published 23 Feb 2023 in cs.CV, cs.AI, cs.LG, and cs.RO

Abstract: Humans can easily imagine the complete 3D geometry of occluded objects and scenes. This appealing ability is vital for recognition and understanding. To enable such capability in AI systems, we propose VoxFormer, a Transformer-based semantic scene completion framework that can output complete 3D volumetric semantics from only 2D images. Our framework adopts a two-stage design where we start from a sparse set of visible and occupied voxel queries from depth estimation, followed by a densification stage that generates dense 3D voxels from the sparse ones. A key idea of this design is that the visual features on 2D images correspond only to the visible scene structures rather than the occluded or empty spaces. Therefore, starting with the featurization and prediction of the visible structures is more reliable. Once we obtain the set of sparse queries, we apply a masked autoencoder design to propagate the information to all the voxels by self-attention. Experiments on SemanticKITTI show that VoxFormer outperforms the state of the art with a relative improvement of 20.0% in geometry and 18.1% in semantics and reduces GPU memory during training to less than 16GB. Our code is available on https://github.com/NVlabs/VoxFormer.

Citations (166)

View on Semantic Scholar

Summary

The paper introduces a two-stage framework that reconstructs visible structures from 2D images before segmenting occluded areas in 3D space.
The method achieves a 20.0% improvement in geometric completion and an 18.1% boost in semantic accuracy on the SemanticKITTI dataset.
The approach leverages sparse voxel representations and transformer-based self-attention to enhance efficiency and reduce computational demands.

Overview of VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion

The paper introduces VoxFormer, a novel framework designed for camera-based 3D semantic scene completion (SSC). This framework leverages a sparse voxel transformer to deduce complete 3D semantics from merely 2D image inputs, addressing challenges in autonomous vehicle (AV) perception, particularly in generating a holistic 3D understanding of environments.

Framework Design

VoxFormer employs a two-stage architecture:

Class-Agnostic Query Proposal: This stage utilizes depth estimation to generate a sparse set of visible and occupied voxel queries. The rationale is that only visible structures should influence the initial feature extraction, preventing ambiguities associated with projecting all 2D features onto a 3D space.
Class-Specific Segmentation: In the second stage, a masked autoencoder architecture is applied. This technique propagates information from sparse voxel queries to all voxels using self-attention mechanisms, completing the 3D scene geometrically and semantically.

Key Ideas and Contributions

Reconstruction-before-Hallucination: This principle underscores the approach of reconstructing visible structures before inferring occluded areas, enhancing reliability and accuracy.
Sparsity-in-3D Space: By focusing on sparse rather than dense representations, VoxFormer is both efficient and scalable. This reduces computational burden and memory usage, with training requirements dropping to below 16GB on GPUs.
Innovative Use of Transformers: The adoption of a transformer framework, similar to masked autoencoders, allows for more sophisticated and effective information propagation to complete the scene representation.

Experimental Evaluation

The VoxFormer model was evaluated on the SemanticKITTI dataset, demonstrating a significant improvement in geometry and semantics over existing methods. Notably, VoxFormer achieved a 20.0% relative improvement in geometric completion and an 18.1% enhancement in semantic accuracy. This performance not only surpasses state-of-the-art camera-based SSC systems, such as MonoScene, but approaches the performance levels of methods utilizing LiDAR data.

Implications and Future Directions

VoxFormer’s ability to accurately generate 3D semantic scenes from 2D images implies a substantial advancement for AV systems, reducing reliance on expensive and resource-heavy LiDAR sensors. The efficiency of sparse voxel representations and the innovative adaptation of transformers hold promise for broad applications in real-time perception systems.

Practically, this approach can improve safety and decision-making in AVs by offering enhanced scene understanding, especially in safety-critical scenarios. Theoretically, VoxFormer pushes boundaries in applying deep learning techniques for spatial reasoning and understanding. Future research might explore refined depth estimation techniques or integrate cooperative perception from multiple cameras to further enhance SSC accuracy.

By integrating temporal information effectively, as demonstrated in VoxFormer-T, future work could also investigate more advanced temporal reasoning to continue improving performance in semantic segmentation tasks, especially in long-range scenarios.

In conclusion, VoxFormer represents a significant step forward in using camera-based systems for 3D semantic scene completion, providing a robust and scalable framework that aligns well with ongoing advancements in computer vision and autonomous vehicle technology.

PDF Markdown

Related Papers

GitHub

GitHub - NVlabs/VoxFormer: Official PyTorch implementation of VoxFormer [CVPR 2023 Highlight] (1,124 stars)

Tweets

https://twitter.com/kakrafoon2/status/1753229874867699717

YouTube

Show All Videos