FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation (2307.01492v1)
Abstract: This technical report summarizes the winning solution for the 3D Occupancy Prediction Challenge, which is held in conjunction with the CVPR 2023 Workshop on End-to-End Autonomous Driving and CVPR 23 Workshop on Vision-Centric Autonomous Driving Workshop. Our proposed solution FB-OCC builds upon FB-BEV, a cutting-edge camera-based bird's-eye view perception design using forward-backward projection. On top of FB-BEV, we further study novel designs and optimization tailored to the 3D occupancy prediction task, including joint depth-semantic pre-training, joint voxel-BEV representation, model scaling up, and effective post-processing strategies. These designs and optimization result in a state-of-the-art mIoU score of 54.19% on the nuScenes dataset, ranking the 1st place in the challenge track. Code and models will be released at: https://github.com/NVlabs/FB-BEV.
- Planning-oriented autonomous driving. In CVPR, 2023.
- BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022.
- BEVDet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv:2112.11790, 2021.
- BEVDepth: Acquisition of reliable depth for multi-view 3d object detection. In AAAI, 2023.
- Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, 2020.
- M22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTBEV: Multi-camera joint 3d detection and segmentation with unified birds-eye view representation. arXiv:2204.05088, 2022.
- MonoScene: Monocular 3d semantic scene completion. In CVPR, 2022.
- OpenOccupancy: A large scale benchmark for surrounding semantic occupancy perception. arXiv:2303.03991, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- A convnet for the 2020s. In CVPR, 2022.
- An energy and gpu-computation efficient backbone network for real-time object detection. In CVPR Workshops, 2019.
- InternImage: Exploring large-scale vision foundation models with deformable convolutions. In CVPR, 2023.
- nuScenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
- Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019.
- Segment anything. arXiv:2304.02643, 2023.
- Microsoft. Neural Network Intelligence. https://github.com/microsoft/nni, 2011.
- Occ3D: A large-scale 3d occupancy prediction benchmark for autonomous driving. arXiv:2304.14365, 2023.
- Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection. In ICLR, 2023.
- Deep residual learning for image recognition. In CVPR, 2016.
- Vision transformer adapter for dense predictions. In ICLR, 2023.
- Zhiqi Li (42 papers)
- Zhiding Yu (94 papers)
- David Austin (5 papers)
- Mingsheng Fang (3 papers)
- Shiyi Lan (38 papers)
- Jan Kautz (215 papers)
- Jose M. Alvarez (90 papers)