Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction (2304.05316v1)

Published 11 Apr 2023 in cs.CV

Abstract: The vision-based perception for autonomous driving has undergone a transformation from the bird-eye-view (BEV) representations to the 3D semantic occupancy. Compared with the BEV planes, the 3D semantic occupancy further provides structural information along the vertical direction. This paper presents OccFormer, a dual-path transformer network to effectively process the 3D volume for semantic occupancy prediction. OccFormer achieves a long-range, dynamic, and efficient encoding of the camera-generated 3D voxel features. It is obtained by decomposing the heavy 3D processing into the local and global transformer pathways along the horizontal plane. For the occupancy decoder, we adapt the vanilla Mask2Former for 3D semantic occupancy by proposing preserve-pooling and class-guided sampling, which notably mitigate the sparsity and class imbalance. Experimental results demonstrate that OccFormer significantly outperforms existing methods for semantic scene completion on SemanticKITTI dataset and for LiDAR semantic segmentation on nuScenes dataset. Code is available at \url{https://github.com/zhangyp15/OccFormer}.

Citations (132)

Summary

  • The paper introduces OccFormer, a dual-path transformer that splits 3D occupancy prediction into local and global pathways to boost computational efficiency.
  • It demonstrates superior performance on SemanticKITTI and nuScenes, achieving a 1.06% mIoU boost in semantic scene completion over baselines.
  • The research validates the practicality of transformer models in cost-effective, vision-based autonomous driving, paving the way for real-time applications.

Dual-path Transformers for 3D Semantic Occupancy Prediction in Autonomous Driving: OccFormer

The paper under review introduces a novel architecture, OccFormer, aimed at advancing vision-based 3D semantic occupancy prediction for autonomous driving systems. This paper proposes a dual-path transformer network to address the inherent challenges in handling 3D voxel data. It leverages the strengths of the transformer architecture to efficiently process spatial and semantic information derived from camera inputs, which are inherently less expensive and more versatile than LiDAR sensors.

Methodological Innovations

OccFormer incorporates a dual-path transformer encoder specifically designed to optimize the efficiency and efficacy of 3D semantic occupancy prediction. The dual-path mechanism is ingeniously utilized to decompose the computationally intensive 3D processing tasks into manageable local and global pathways. This is achieved by processing along the horizontal plane, thereby preserving the intricate semantic structures and global contexts.

The paper delineates the dual-path transformer block as a hybrid structure, utilizing shared windowed attention along both the local and global paths. Local paths target fine-grained semantic structures along BEV (bird's eye view) planes, while global paths capture scene-level layout through collapsed BEV features. Outputs from these pathways are adaptively combined for enhanced feature representation. These architectural choices underscore the potential for transformers to outperform classic 3D convolutional networks in terms of both parameter efficiency and computational demand.

OccFormer also extends the capacity of the transformer-based decoder by adapting Mask2Former for 3D purposes. Notable improvements include the incorporation of preserve-pooling and class-guided sampling to tackle the challenges of sparsity and class imbalance commonly encountered with 3D data.

Empirical Performance

The empirical analysis employs rigorous testing on the SemanticKITTI and nuScenes datasets. OccFormer demonstrates superior performance outcomes in semantic scene completion and LiDAR segmentation tasks compared to existing state-of-the-art approaches. Specifically, the model shows a 1.06% improvement in SSC mIoU over baseline models on the SemanticKITTI dataset, underlining the effectiveness of the dual-path transformer strategy in obtaining more precise semantic inferences from 3D data. It also yields competitive results against LiDAR-based methods on the nuScenes dataset, asserting the model's ability to generate more complete and realistic occupancy predictions.

Implications and Future Directions

OccFormer provides significant insights into leveraging transformers for 3D semantic prediction tasks and effectively addresses the trade-offs between detail preservation and computational cost. From a practical standpoint, this research lays the groundwork for more efficient, cost-effective autonomous systems that rely on vision over LiDAR. Theoretically, it further validates the potential of transformers in capturing complex spatial hierarchies and semantic relations within 3D environments.

Looking forward, several research directions are worthy of exploration. Enhancing real-time applicability of such models could substantially impact autonomous vehicle navigation in dynamic environments. Moreover, expanding these methodologies to incorporate seamlessly sensor fusion techniques could offer even richer environmental understanding, possibly extending beyond the confines of current autonomous driving use cases.

The open-sourced code further ensures replicability and encourages community engagement, fostering potential advancements in related research areas. The contribution of preserve-pooling and class-guided sampling techniques presents promising avenues for future exploration. Given their effectiveness in ameliorating data sparsity and imbalance challenges, extensions of these strategies might be applicable to other domains involving imbalanced and sparse datasets.

In summary, OccFormer reflects a notable advance in applying transformer architectures to 3D perception tasks, achieving commendable empirical results and offering a compelling foundation for future developments in vision-based perception systems.