Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments (2312.09243v3)

Published 14 Dec 2023 in cs.CV

Abstract: Occupancy prediction reconstructs 3D structures of surrounding environments. It provides detailed information for autonomous driving planning and navigation. However, most existing methods heavily rely on the LiDAR point clouds to generate occupancy ground truth, which is not available in the vision-based system. In this paper, we propose an OccNeRF method for training occupancy networks without 3D supervision. Different from previous works which consider a bounded scene, we parameterize the reconstructed occupancy fields and reorganize the sampling strategy to align with the cameras' infinite perceptive range. The neural rendering is adopted to convert occupancy fields to multi-camera depth maps, supervised by multi-frame photometric consistency. Moreover, for semantic occupancy prediction, we design several strategies to polish the prompts and filter the outputs of a pretrained open-vocabulary 2D segmentation model. Extensive experiments for both self-supervised depth estimation and 3D occupancy prediction tasks on nuScenes and SemanticKITTI datasets demonstrate the effectiveness of our method.

Introduction to Occupancy Prediction

Occupancy prediction is a critical component of vision-based perception systems, especially in contexts like autonomous driving planning and navigation. These systems aim to reconstruct the 3D structures of environments, which aids in understanding the surrounding area in detail. Traditionally, such systems have depended on LiDAR (Light Detection and Ranging) to gather geometric information, but LiDAR has its limitations, including high costs and sparse data at times.

Self-Supervised Multi-Camera Approach

To overcome the need for LiDAR and make use of abundant image data, this paper introduces OccNeRF, a self-supervised method for multi-camera occupancy prediction. The novelty of OccNeRF lies in its ability to work with unbounded scenes using raw images rather than relying on 3D labels or LiDAR data. It uses a neural radiance field (NeRF) approach to generate occupancy fields and depth maps from multi-camera images, and it focuses on ensuring multi-frame photometric consistency—a method commonly seen in depth estimation tasks.

Advancements in Semantic Occupancy Prediction

For semantic occupancy prediction, which involves understanding the type of objects present and their layouts, the method employs an open-vocabulary segmentation model. This allows it to use existing 2D semantic segmentation data to aid in the 3D occupancy prediction tasks. Remarkably, the model leverages semantic cues to enhance the spatial awareness of the scene reconstruction.

Validation and Potential

OccNeRF's effectiveness is demonstrated through extensive experimentation on the nuScenes dataset, a benchmark for autonomous driving systems. Comparisons on this dataset show that OccNeRF excels in self-supervised depth estimation tasks and achieves notable success in semantic occupancy prediction. It's a step forward in utilizing self-supervised methods for understanding 3D spaces based on image data alone, presenting a less expensive alternative to traditional methods and potentially widening the scope of autonomous systems that can adapt to such technology.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (91)
  1. Stretchbev: Stretching future instance prediction spatially and temporally. arXiv preprint arXiv:2203.13641, 2022.
  2. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In ICCV, pages 5855–5864, 2021.
  3. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In CVPR, pages 5470–5479, 2022.
  4. Zip-nerf: Anti-aliased grid-based neural radiance fields. In ICCV, 2023.
  5. Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video. In NeurIPS, pages 35–45, 2019.
  6. Nerd: Neural reflectance decomposition from image collections. In ICCV, pages 12684–12694, 2021.
  7. nuscenes: A multimodal dataset for autonomous driving. In CVPR, pages 11621–11631, 2020.
  8. Anh-Quan Cao and Raoul de Charette. Monoscene: Monocular 3d semantic scene completion. In CVPR, pages 3991–4001, 2022.
  9. Anh-Quan Cao and Raoul de Charette. Scenerf: Self-supervised monocular 3d scene reconstruction with radiance fields. In ICCV, pages 9387–9398, 2023.
  10. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In ICCV, pages 14124–14133, 2021.
  11. Tensorf: Tensorial radiance fields. In ECCV, pages 333–350. Springer, 2022.
  12. Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In ICCV, pages 7063–7072, 2019.
  13. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
  14. Depth-supervised nerf: Fewer views and faster training for free. In CVPR, pages 12882–12891, 2022.
  15. Plenoxels: Radiance fields without neural networks. In CVPR, pages 5501–5510, 2022.
  16. Deep ordinal regression network for monocular depth estimation. In CVPR, pages 2002–2011, 2018.
  17. A simple attempt for 3d occupancy estimation in autonomous driving. arXiv preprint arXiv:2303.10076, 2023.
  18. Dynamic view synthesis from dynamic monocular video. In ICCV, pages 5712–5721, 2021.
  19. Fastnerf: High-fidelity neural rendering at 200fps. In ICCV, pages 14346–14355, 2021.
  20. Unsupervised monocular depth estimation with left-right consistency. In CVPR, pages 270–279, 2017.
  21. Digging into self-supervised monocular depth estimation. In ICCV, pages 3828–3838, 2019.
  22. 3d packing for self-supervised monocular depth estimation. In CVPR, pages 2485–2494, 2020.
  23. Full surround monodepth from multiple cameras. RAL, 7(2):5397–5404, 2022.
  24. Simple-BEV: What really matters for multi-sensor bev perception? In IEEE International Conference on Robotics and Automation (ICRA), 2023.
  25. S4c: Self-supervised semantic scene completion with neural fields. arXiv preprint arXiv:2310.07522, 2023.
  26. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  27. Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras. In ICCV, pages 15273–15282, 2021.
  28. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
  29. Tri-perspective view for vision-based 3d semantic occupancy prediction. In CVPR, pages 9223–9232, 2023.
  30. IDEA-Research. Grounded segment anything. https://github.com/IDEA-Research/Grounded-Segment-Anything.
  31. 3d gaussian splatting for real-time radiance field rendering. TOG, 42(4):1–14, 2023.
  32. Self-supervised surround-view depth estimation with volumetric feature fusion. NeurlPS, 35:4032–4045, 2022.
  33. Segment anything. In ICCV, 2023.
  34. Supervising the new with the old: learning SFM from SFM. In ECCV, pages 698–713, 2018.
  35. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326, 2019.
  36. Mine: Towards continuous depth mpi with nerf for novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12578–12588, 2021a.
  37. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. arXiv preprint arXiv:2206.10092, 2022a.
  38. Neural scene flow fields for space-time view synthesis of dynamic scenes. In CVPR, pages 6498–6508, 2021b.
  39. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022b.
  40. Bevfusion: A simple and robust lidar-camera fusion framework. arXiv preprint arXiv:2205.13790, 2022.
  41. Barf: Bundle-adjusting neural radiance fields. In ICCV, pages 5741–5751, 2021.
  42. Learning depth from single monocular images using deep convolutional neural fields. TPAMI, 38(10):2024–2039, 2015.
  43. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  44. Petr: Position embedding transformation for multi-view 3d object detection. arXiv preprint arXiv:2203.05625, 2022a.
  45. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. arXiv preprint arXiv:2205.13542, 2022b.
  46. Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In CVPR, pages 5667–5675, 2018.
  47. Nerf in the wild: Neural radiance fields for unconstrained photo collections. In CVPR, pages 7210–7219, 2021.
  48. Nelson Max. Optical models for direct volume rendering. IEEE Transactions on Visualization and Computer Graphics, 1(2):99–108, 1995.
  49. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, pages 405–421. Springer, 2020.
  50. Nerf in the dark: High dynamic range view synthesis from noisy raw images. In CVPR, pages 16190–16199, 2022.
  51. Instant neural graphics primitives with a multiresolution hash encoding. TOG, 41(4):1–15, 2022.
  52. Giraffe: Representing scenes as compositional generative neural feature fields. In CVPR, pages 11453–11464, 2021.
  53. Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In ICCV, pages 5589–5599, 2021.
  54. Renderocc: Vision-centric 3d occupancy prediction with 2d rendering supervision. arXiv preprint arXiv:2309.09502, 2023.
  55. Nerfies: Deformable neural radiance fields. In ICCV, pages 5865–5874, 2021.
  56. D-nerf: Neural radiance fields for dynamic scenes. In CVPR, pages 10318–10327, 2021.
  57. Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation. In CVPR, pages 12240–12249, 2019.
  58. Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps. In ICCV, pages 14335–14345, 2021.
  59. Monocular depth estimation using neural regression forest. In CVPR, pages 5506–5514, 2016.
  60. R3d3: Dense 3d reconstruction of dynamic scenes from multiple cameras. In ICCV, pages 3216–3226, 2023.
  61. Beyond Photometric Loss for Self-Supervised Ego-Motion Estimation. arXiv preprint arXiv:1902.09103, 2019a.
  62. Self-Supervised Learning of Depth and Motion Under Photometric Inconsistency. In ICCVW, pages 0–0, 2019b.
  63. Ega-depth: Efficient guided attention for self-supervised multi-camera depth estimation. In CVPRW, pages 119–129, 2023.
  64. Nerv: Neural reflectance and visibility fields for relighting and view synthesis. In CVPR, pages 7495–7504, 2021.
  65. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In CVPR, pages 5459–5469, 2022.
  66. Block-nerf: Scalable large scene neural view synthesis. In CVPR, pages 8248–8258, 2022.
  67. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. arXiv preprint arXiv:2304.14365, 2023.
  68. Scene as occupancy. In ICCV, pages 8406–8415, 2023.
  69. Learning monocular depth estimation infusing traditional stereo knowledge. In CVPR, pages 9799–9809, 2019.
  70. Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. In ICCV, pages 12959–12970, 2021.
  71. Learning depth from monocular videos using direct methods. In CVPR, pages 2022–2030, 2018.
  72. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. NeurlPS, 34:27171–27183, 2021a.
  73. Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. In ICCV, 2023.
  74. Nerf–: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064, 2021b.
  75. Self-Supervised Monocular Depth Hints. In ICCV, pages 2162–2171, 2019.
  76. Nerfingmvs: Guided optimization of neural radiance fields for indoor multi-view stereo. In ICCV, pages 5610–5619, 2021.
  77. Surrounddepth: Entangling surrounding views for self-supervised multi-camera depth estimation. In CoRL, pages 539–549. PMLR, 2023a.
  78. Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In ICCV, pages 21729–21740, 2023b.
  79. Multi-camera collaborative depth prediction via consistent structure estimation. In ACMMM, pages 2730–2738, 2022a.
  80. Point-nerf: Point-based neural radiance fields. In CVPR, pages 5438–5448, 2022b.
  81. Volume rendering of neural implicit surfaces. NeurlPS, 34:4805–4815, 2021.
  82. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In CVPR, pages 1983–1992, 2018.
  83. Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction. NeurlPS, 35:25018–25032, 2022.
  84. Nerf++: Analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492, 2020.
  85. Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743, 2022.
  86. Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. In ICCV, 2023.
  87. Progressive hard-mining network for monocular depth estimation. TIP, 27(8):3691–3702, 2018.
  88. Moving Indoor: Unsupervised Video Depth Learning in Challenging Environments. In ICCV, pages 8618–8627, 2019.
  89. Unsupervised learning of depth and ego-motion from video. In CVPR, pages 1851–1858, 2017.
  90. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In ECCV, pages 36–53, 2018.
  91. Pointocc: Cylindrical tri-perspective view for point-based 3d semantic occupancy prediction. arXiv preprint arXiv:2308.16896, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Chubin Zhang (4 papers)
  2. Juncheng Yan (3 papers)
  3. Yi Wei (60 papers)
  4. Jiaxin Li (57 papers)
  5. Li Liu (311 papers)
  6. Yansong Tang (81 papers)
  7. Yueqi Duan (47 papers)
  8. Jiwen Lu (192 papers)
Citations (6)