Fully Sparse 3D Occupancy Prediction (2312.17118v5)

Published 28 Dec 2023 in cs.CV

Abstract: Occupancy prediction plays a pivotal role in autonomous driving. Previous methods typically construct dense 3D volumes, neglecting the inherent sparsity of the scene and suffering from high computational costs. To bridge the gap, we introduce a novel fully sparse occupancy network, termed SparseOcc. SparseOcc initially reconstructs a sparse 3D representation from camera-only inputs and subsequently predicts semantic/instance occupancy from the 3D sparse representation by sparse queries. A mask-guided sparse sampling is designed to enable sparse queries to interact with 2D features in a fully sparse manner, thereby circumventing costly dense features or global attention. Additionally, we design a thoughtful ray-based evaluation metric, namely RayIoU, to solve the inconsistency penalty along the depth axis raised in traditional voxel-level mIoU criteria. SparseOcc demonstrates its effectiveness by achieving a RayIoU of 34.0, while maintaining a real-time inference speed of 17.3 FPS, with 7 history frames inputs. By incorporating more preceding frames to 15, SparseOcc continuously improves its performance to 35.1 RayIoU without bells and whistles.

References (55)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces SparseOcc, a method that leverages scene sparsity to reduce computational overhead in 3D occupancy prediction.
It employs a sparse voxel decoder and a mask transformer for efficient parsing and semantic segmentation of 3D data.
The approach yields impressive RayIoU scores (34.0 at 17.3 FPS, up to 35.1 with more frames), making it ideal for real-time autonomous driving.

Analysis of "Fully Sparse 3D Occupancy Prediction"

The paper "Fully Sparse 3D Occupancy Prediction" introduces an advanced methodology in the domain of autonomous driving, where the authors propose SparseOcc, a novel approach that effectively utilizes sparsity in 3D scene representation to enhance occupancy prediction. This work addresses several inefficiencies in previous 3D occupancy prediction methods that relied heavily on dense 3D volume representations, leading to significant computational overhead.

Introduction and Key Contributions

Traditionally, 3D occupancy prediction methods decompose visual scenes into dense volumetric grids, an approach that fails to exploit the inherent sparsity of the natural environment, where the majority of space is empty. The innovation of SparseOcc lies in its fully sparse architecture, which significantly reduces computational load by focusing solely on non-empty voxels. The authors present a systematic approach that incorporates both a sparse voxel decoder and a mask transformer to handle 3D occupancy prediction. Additionally, they introduce RayIoU, a more robust and logical metric for evaluation which mitigates issues associated with traditional voxel-level mIoU.

The strong performance of SparseOcc is evidenced by its RayIoU score of 34.0, achieved at an impressive real-time inference speed of 17.3 FPS, utilizing only 7 historical frames. Notably, this performance scales up with the inclusion of additional frames, reaching 35.1 RayIoU with 15 frames, marking a significant improvement over existing benchmarks.

SparseOcc Components

SparseOcc comprises two primary components:

Sparse Voxel Decoder: This module leverages a coarse-to-fine representation strategy that adheres to the actual geometric sparsity of the scene. It uses transformer-based operations optimized for removing redundant computations inherent in dense voxel processing. By iteratively querying and pruning voxel representation based on occupancy probabilities, the model economizes processing power by focusing on non-empty space.
Mask Transformer: Following the sparse geometry extraction, this component interprets sparse voxel data to predict occupancy masks and class labels. It introduces a novel mask-guided sparse sampling mechanism, enabling an efficient attention mechanism that circumvents the need for exhaustive dense cross-attention, therefore enhancing the model's ability to conduct parsing of scene and instance-level information.

RayIoU: A Novel Evaluation Metric

The inadequacies of traditional voxel-level mIoU prompted the authors to develop RayIoU, a metric tailored for alignment with real-world applications. RayIoU evaluates the accuracy of predicted occupancy by using ray casting that mimics LiDAR systems, aligning evaluated predictions along their first significant intersection point with occupancies, thus offering a more realistic assessment.

Experimental Validation and Implications

SparseOcc's potency is validated through its performance on the Occ3D-nus dataset, demonstrating not only competitive accuracy but also computational efficiency. These attributes are critical for real-time applications in autonomous vehicles where computational resources and timely decision-making are constrained. The experimental section thoroughly benchmarks SparseOcc against other state-of-the-art methods, establishing its superiority in both accuracy and efficiency without the need for additional computational tricks.

Conclusions and Future Directions

The implications of this paper are manifold. SparseOcc sets a precedent for future systems with its emphasis on exploiting spatial sparsity to improve model efficiency. The introduction of RayIoU also shifts how 3D occupancy models might be evaluated moving forward, promising more utility-aligned performance metrics.

In future work, there is potential for extending the SparseOcc framework to accommodate richer temporal data and integrating alternative sensor modalities to further refine the three-dimensional scene understanding in autonomous systems. Overall, SparseOcc represents a meaningful contribution to the evolution of computational strategies in high-stakes autonomous navigation.

PDF Markdown

GitHub

GitHub - MCG-NJU/SparseOcc: Fully Sparse 3D Occupancy Prediction & RayIoU Evaluation Metric (113 stars)