GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction (2405.17429v1)

Published 27 May 2024 in cs.CV and cs.AI

Abstract: 3D semantic occupancy prediction aims to obtain 3D fine-grained geometry and semantics of the surrounding scene and is an important task for the robustness of vision-centric autonomous driving. Most existing methods employ dense grids such as voxels as scene representations, which ignore the sparsity of occupancy and the diversity of object scales and thus lead to unbalanced allocation of resources. To address this, we propose an object-centric representation to describe 3D scenes with sparse 3D semantic Gaussians where each Gaussian represents a flexible region of interest and its semantic features. We aggregate information from images through the attention mechanism and iteratively refine the properties of 3D Gaussians including position, covariance, and semantics. We then propose an efficient Gaussian-to-voxel splatting method to generate 3D occupancy predictions, which only aggregates the neighboring Gaussians for a certain position. We conduct extensive experiments on the widely adopted nuScenes and KITTI-360 datasets. Experimental results demonstrate that GaussianFormer achieves comparable performance with state-of-the-art methods with only 17.8% - 24.8% of their memory consumption. Code is available at: https://github.com/huang-yh/GaussianFormer.

Citations (13)

View on Semantic Scholar

Summary

The paper presents a novel 3D semantic Gaussian representation that significantly reduces memory and computational overhead compared to voxel-based methods.
The paper develops the GaussianFormer model, leveraging sparse convolutions and cross-attention to transform multi-view 2D images into detailed 3D semantic maps.
The paper validates its approach on nuScenes and KITTI-360, achieving competitive IoU scores while reducing memory consumption by up to 82%.

GaussianFormer: A Novel Approach to Vision-Based 3D Semantic Occupancy Prediction

The paper entitled “GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction” proposes a novel methodology for 3D semantic occupancy prediction, leveraging sparse Gaussian representations and attention mechanisms to efficiently translate 2D images into 3D semantic maps. This paper, authored by Yuanhui Huang et al., introduces an object-centric representation termed "3D semantic Gaussians," which encapsulates fine-grained 3D scene structures while significantly reducing computational and memory overhead compared to traditional voxel-based methods.

Key Contributions

Object-Centric 3D Gaussian Representation: The authors present a sparse representation of 3D scenes where individual 3D Gaussians constitute regions of interest, described by mean, covariance, and semantic features. This approach addresses the inefficiencies of dense voxel grids which often overlook the sparsity of actual 3D occupancy and the variability in object scales.
GaussianFormer Model: The GaussianFormer model is proposed to transform multi-view 2D images into these 3D semantic Gaussian representations. The model leverages sparse convolution for self-encoding interactions among Gaussians, cross-attention to integrate visual information, and iterative refinement of Gaussian properties.
Gaussian-to-Voxel Splatting Module: For generating dense 3D occupancy maps, the authors design an efficient Gaussian-to-voxel splatting module. This module utilizes local aggregation of neighboring Gaussians, efficiently processing the sparse 3D Gaussians to predict occupancy by leveraging the locality principle of Gaussian distributions.

Experimental Validation

The proposed methods were rigorously tested on the nuScenes and KITTI-360 datasets, achieving strong performance in 3D semantic occupancy prediction tasks. Notably, GaussianFormer outperforms methods based on planar representations, such as BEVFormer and TPVFormer, and shows comparable efficacy with voxel-based approaches like OccFormer and SurroundOcc, while dramatically reducing memory consumption (a reduction of 75.2% - 82.2%).

Quantitative Results on nuScenes:

GaussianFormer achieves an SSC-mIoU (Semantic Scene Completion mean Intersection-over-Union) of 19.10% and a SC-IoU (Scene Completion Intersection-over-Union) of 29.83%, showcasing its robust performance.
In comparison to other state-of-the-art methods, OccFormer exhibits an SSC-mIoU of 19.03%, and SurroundOcc registers a 20.30%. Despite the slight numerical differences, the significant reduction in memory consumption substantiates the efficacy of GaussianFormer.

Quantitative Results on KITTI-360:

GaussianFormer maintains competitive performance, demonstrating particular strength in predicting fine details of smaller and general categories through flexible adjustments of Gaussian covariances to object shapes.

Implications and Future Directions

The implications of this research are multidimensional:

Practical Implications: The notable reduction in memory consumption positions GaussianFormer as a highly viable candidate for deployment in real-world applications such as autonomous driving, where computational resources are often constrained. The efficient representation and prediction method foster real-time applicability while maintaining high accuracy.
Theoretical Implications: The introduction of 3D semantic Gaussians proposes a shift from grid-centric to object-centric paradigms in scene representation and raises interesting avenues in further optimizing sparse representations. It encourages a reconsideration of how three-dimensional scenes are modeled in computer vision, underscoring the potential of adaptive and flexible feature representations.

Conclusion

This paper presents an innovative approach to 3D semantic occupancy prediction through Gaussian representations, addressing long-standing inefficiencies in traditional methods. By integrating sparse convolutions, cross-attention mechanisms, and a novel splatting module, GaussianFormer establishes a framework that is simultaneously efficient and effective. The reported results on benchmark datasets affirm its potential in both practical deployment and theoretical advancements. Future research might explore optimizing the interaction mechanisms within sparse representations and exploring broader applications wherein this methodology could be beneficial.

Related Papers

GitHub

GitHub - huang-yh/GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction (103 stars)

Tweets

https://twitter.com/janusch_patas/status/1795324396170838299

https://twitter.com/arxivsanitybot/status/1796171952811319460

https://twitter.com/gm8xx8/status/1795286750824407466