Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation (2307.01492v1)

Published 4 Jul 2023 in cs.CV and cs.RO

Abstract: This technical report summarizes the winning solution for the 3D Occupancy Prediction Challenge, which is held in conjunction with the CVPR 2023 Workshop on End-to-End Autonomous Driving and CVPR 23 Workshop on Vision-Centric Autonomous Driving Workshop. Our proposed solution FB-OCC builds upon FB-BEV, a cutting-edge camera-based bird's-eye view perception design using forward-backward projection. On top of FB-BEV, we further study novel designs and optimization tailored to the 3D occupancy prediction task, including joint depth-semantic pre-training, joint voxel-BEV representation, model scaling up, and effective post-processing strategies. These designs and optimization result in a state-of-the-art mIoU score of 54.19% on the nuScenes dataset, ranking the 1st place in the challenge track. Code and models will be released at: https://github.com/NVlabs/FB-BEV.

FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation

The paper "FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation" presents a comprehensive approach to 3D occupancy prediction, showcasing state-of-the-art methodologies in the context of autonomous driving. The research addresses the task of predicting the occupancy status and semantic class of each voxel within a 3D space, crucial for the planning and perception aspects of autonomous vehicles (AVs).

Overview of FB-OCC Solution

The FB-OCC model builds on FB-BEV, a sophisticated bird's-eye view (BEV) perception framework. It leverages forward-backward projection to enhance 3D vision from camera inputs. Notably, the paper explores advancements through:

  1. Joint Depth-Semantic Pre-training: Combining depth estimation with semantic segmentation to enrich geometrical and semantic understanding.
  2. Joint Voxel-BEV Representation: Merging voxel-level data with BEV features for refined occupancy prediction.
  3. Model Scaling and Optimization: Scaling the model while addressing conventional overfitting issues typical in large 3D perception models.
  4. Effective Post-Processing Strategies: Including test-time augmentation and ensemble techniques for performance enhancement.

Methodological Insights

Model Design

FB-OCC integrates both forward and backward projection strategies into a cohesive framework, improving model perception by exploiting the strengths of each approach. The method begins with forward projection to derive an initial voxel representation and continues with backward projection to refine these representations using BEV features. This duality yields a robust understanding of the 3D space, critical for occupancy prediction.

Model Scaling and Pre-Training

To address scaling challenges, FB-OCC employs the InternImage-H backbone, containing one billion parameters, highlighting the utility of extensive pre-training on large datasets like Object365. This enhances both semantic perception and geometrical awareness, achieved through tailored pre-training on tasks such as depth estimation aligned with semantic segmentation.

Post-Processing Techniques

Test-time augmentation and ensemble strategies play a vital role in the post-processing phase. By averaging predictions from various augmented scenarios and combining different models, the approach counters distance-induced accuracy degradation, achieving a significant improvement in mIoU scores.

Experimental Outcomes

The research substantiates its claims through robust experimental evaluations using the nuScenes dataset. The proposed FB-OCC model achieved a leading mIoU score of 54.19%, outperforming existing models and securing the top position in the 3D Occupancy Prediction Challenge.

Implications and Future Work

While the FB-OCC method illustrates the potential for enhanced AV perception, this work invites further exploration into scalable models that maintain efficiency without compromising on detail. The findings also underscore the growing importance of integrating large-scale 2D pre-training with 3D tasks, suggesting avenues for further advancements in semantic understanding and geometry consistency in AV systems.

Future developments could focus on refining model interpretation in complex scenarios and minimizing computational demands through optimized frameworks, potentially integrating multi-sensor data for enriched spatial understanding.

In conclusion, the research in FB-OCC makes significant contributions to the field of autonomous driving, emphasizing the role of sophisticated view transformation and extensive pre-training in enhancing 3D occupancy prediction. Its implications are far-reaching, offering valuable insights for researchers and industry practitioners aiming to advance autonomous vehicle technologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Planning-oriented autonomous driving. In CVPR, 2023.
  2. BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022.
  3. BEVDet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv:2112.11790, 2021.
  4. BEVDepth: Acquisition of reliable depth for multi-view 3d object detection. In AAAI, 2023.
  5. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, 2020.
  6. M22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTBEV: Multi-camera joint 3d detection and segmentation with unified birds-eye view representation. arXiv:2204.05088, 2022.
  7. MonoScene: Monocular 3d semantic scene completion. In CVPR, 2022.
  8. OpenOccupancy: A large scale benchmark for surrounding semantic occupancy perception. arXiv:2303.03991, 2023.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  10. A convnet for the 2020s. In CVPR, 2022.
  11. An energy and gpu-computation efficient backbone network for real-time object detection. In CVPR Workshops, 2019.
  12. InternImage: Exploring large-scale vision foundation models with deformable convolutions. In CVPR, 2023.
  13. nuScenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
  14. Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019.
  15. Segment anything. arXiv:2304.02643, 2023.
  16. Microsoft. Neural Network Intelligence. https://github.com/microsoft/nni, 2011.
  17. Occ3D: A large-scale 3d occupancy prediction benchmark for autonomous driving. arXiv:2304.14365, 2023.
  18. Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection. In ICLR, 2023.
  19. Deep residual learning for image recognition. In CVPR, 2016.
  20. Vision transformer adapter for dense predictions. In ICLR, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhiqi Li (42 papers)
  2. Zhiding Yu (94 papers)
  3. David Austin (5 papers)
  4. Mingsheng Fang (3 papers)
  5. Shiyi Lan (38 papers)
  6. Jan Kautz (215 papers)
  7. Jose M. Alvarez (90 papers)
Citations (69)