Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction (2311.12754v2)

Published 21 Nov 2023 in cs.CV, cs.AI, and cs.LG

Abstract: 3D occupancy prediction is an important task for the robustness of vision-centric autonomous driving, which aims to predict whether each point is occupied in the surrounding 3D space. Existing methods usually require 3D occupancy labels to produce meaningful results. However, it is very laborious to annotate the occupancy status of each voxel. In this paper, we propose SelfOcc to explore a self-supervised way to learn 3D occupancy using only video sequences. We first transform the images into the 3D space (e.g., bird's eye view) to obtain 3D representation of the scene. We directly impose constraints on the 3D representations by treating them as signed distance fields. We can then render 2D images of previous and future frames as self-supervision signals to learn the 3D representations. We propose an MVS-embedded strategy to directly optimize the SDF-induced weights with multiple depth proposals. Our SelfOcc outperforms the previous best method SceneRF by 58.7% using a single frame as input on SemanticKITTI and is the first self-supervised work that produces reasonable 3D occupancy for surround cameras on nuScenes. SelfOcc produces high-quality depth and achieves state-of-the-art results on novel depth synthesis, monocular depth estimation, and surround-view depth estimation on the SemanticKITTI, KITTI-2015, and nuScenes, respectively. Code: https://github.com/huang-yh/SelfOcc.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (86)
  1. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In ICCV, pages 5855–5864, 2021.
  2. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In CVPR, pages 5470–5479, 2022.
  3. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In ICCV, pages 9297–9307, 2019.
  4. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
  5. Semantic scene completion via integrating instances and scene in-the-loop. In CVPR, pages 324–333, 2021.
  6. Anh-Quan Cao and Raoul de Charette. Monoscene: Monocular 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3991–4001, 2022.
  7. Anh-Quan Cao and Raoul de Charette. Scenerf: Self-supervised monocular 3d scene reconstruction with radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9387–9398, 2023.
  8. Tensorf: Tensorial radiance fields. In ECCV, pages 333–350. Springer, 2022a.
  9. End-to-end autonomous driving: Challenges and frontiers. arXiv preprint arXiv:2306.16927, 2023.
  10. 3d sketch-aware semantic scene completion via semi-supervised structure prior. In CVPR, 2020.
  11. Mppnet: Multi-frame feature intertwining with proxy points for 3d temporal object detection. In ECCV, pages 680–697. Springer, 2022b.
  12. 3d semantic scene completion from a single depth image using adversarial training. In ICIP, pages 1835–1839. IEEE, 2019.
  13. S3cnet: A sparse semantic scene completion network for lidar point clouds. In CoRL, pages 2148–2161. PMLR, 2021.
  14. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. Ieee, 2009.
  15. Depth-supervised NeRF: Fewer views and faster training for free. In CVPR, 2022.
  16. Semantic scene completion from a single 360-degree image and depth map. In VISIGRAPP (5: VISAPP), pages 36–46, 2020.
  17. Edgenet: Semantic scene completion from a single rgb-d image. In ICPR, pages 503–510. IEEE, 2021.
  18. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In ICCV, pages 2650–2658, 2015.
  19. Plenoxels: Radiance fields without neural networks. In CVPR, pages 5501–5510, 2022.
  20. Two stream 3d semantic scene completion. In CVPR Workshops, pages 0–0, 2019.
  21. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, pages 3354–3361. IEEE, 2012.
  22. Unsupervised monocular depth estimation with left-right consistency. In CVPR, pages 270–279, 2017.
  23. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3828–3838, 2019.
  24. Implicit geometric regularization for learning shapes. arXiv preprint arXiv:2002.10099, 2020.
  25. 3d packing for self-supervised monocular depth estimation. In CVPR, pages 2485–2494, 2020.
  26. Full surround monodepth from multiple cameras. arXiv preprint arXiv:2104.00152, 2021.
  27. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
  28. Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras. In ICCV, 2021.
  29. Goal-oriented autonomous driving. arXiv preprint arXiv:2212.10156, 2022.
  30. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
  31. Tri-perspective view for vision-based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9223–9232, 2023.
  32. Vad: Vectorized scene representation for efficient autonomous driving. arXiv preprint arXiv:2303.12077, 2023.
  33. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In CVPR, 2020.
  34. Rgbd based dimensional decomposition residual network for 3d semantic scene completion. In CVPR, pages 7693–7702, 2019a.
  35. Depth based semantic scene completion with position importance aware loss. IEEE Robotics and Automation Letters, 5(1):219–226, 2019b.
  36. Anisotropic convolutional networks for 3d semantic scene completion. In CVPR, 2020a.
  37. Mine: Towards continuous depth mpi with nerf for novel view synthesis. In ICCV, 2021.
  38. Attention-based multi-modal fusion network for semantic scene completion. In AAAI, pages 11402–11409, 2020b.
  39. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. arXiv preprint arXiv:2206.10092, 2022a.
  40. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. arXiv preprint arXiv:2203.17270, 2022b.
  41. Vision transformer for nerf-based view synthesis from a single input image. In WACV, 2023.
  42. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
  43. Feature pyramid networks for object detection. In CVPR, 2017.
  44. See and think: Disentangling semantic scene completion. NIPS, 31, 2018.
  45. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  46. Nelson Max. Optical models for direct volume rendering. IEEE Transactions on Visualization and Computer Graphics, 1(2):99–108, 1995.
  47. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  48. Autorf: Learning 3d object radiance fields from single view observations. In CVPR, pages 3971–3980, 2022a.
  49. Instant neural graphics primitives with a multiresolution hash encoding. ToG, 41(4):1–15, 2022b.
  50. Bev-seg: Bird’s eye view semantic segmentation using geometry and semantic point cloud. arXiv preprint arXiv:2006.11436, 2020.
  51. Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5480–5490, 2022.
  52. Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In ICCV, pages 5589–5599, 2021.
  53. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, 2020.
  54. Offboard 3d object detection from point cloud sequences. In CVPR, pages 6134–6144, 2021.
  55. Urban radiance fields. In CVPR, 2022.
  56. Semantic scene completion using local deep implicit functions on lidar data. TPAMI, 44(10):7205–7218, 2021.
  57. Dense depth priors for neural radiance fields from sparse input views. In CVPR, 2022.
  58. Lmscnet: Lightweight multiscale 3d semantic completion. In ThreeDV, 2020.
  59. R3d3: Dense 3d reconstruction of dynamic scenes from multiple cameras. In ICCV, pages 3216–3226, 2023.
  60. Objects365: A large-scale, high-quality dataset for object detection. In ICCV, pages 8430–8439, 2019.
  61. Feature-metric loss for self-supervised learning of depth and egomotion. In ECCV, 2020.
  62. Semantic scene completion from a single depth image. In CVPR, pages 1746–1754, 2017.
  63. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. arXiv preprint arXiv:2304.14365, 2023.
  64. Scene as occupancy. In ICCV, pages 8406–8415, 2023.
  65. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021.
  66. Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. arXiv preprint arXiv:2303.03991, 2023.
  67. Forknet: Multi-branch volumetric semantic completion from a single depth image. In ICCV, pages 8608–8617, 2019.
  68. Nerfingmvs: Guided optimization of neural radiance fields for indoor multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5610–5619, 2021.
  69. Surrounddepth: Entangling surrounding views for self-supervised multi-camera depth estimation. In CoRL, pages 539–549. PMLR, 2023.
  70. SynSin: End-to-end view synthesis from a single image. In CVPR, 2020.
  71. Behind the scenes: Density fields for single view reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9076–9086, 2023.
  72. Scfusion: Real-time incremental scene reconstruction with semantic completion. In 3DV, pages 801–810. IEEE, 2020.
  73. Sparse single sweep lidar point cloud segmentation via learning contextual shape priors from scene completion. In AAAI, pages 3101–3109, 2021.
  74. Auto4d: Learning to label 4d objects from sequential point clouds. arXiv preprint arXiv:2101.06586, 2021.
  75. pixelNeRF: Neural radiance fields from one or few images. In CVPR, 2021.
  76. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In CVPR, 2018.
  77. A simple framework for open-vocabulary segmentation and detection. In ICCV, pages 1020–1031, 2023a.
  78. Efficient semantic scene completion network with spatial group convolution. In ECCV, pages 733–749, 2018.
  79. Critical regularizations for neural surface reconstruction in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6270–6279, 2022a.
  80. Cascaded context pyramid for full-resolution 3d semantic scene completion. In ICCV, pages 7801–7810, 2019.
  81. Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743, 2022b.
  82. Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. arXiv preprint arXiv:2304.05316, 2023b.
  83. Semantic point completion network for 3d semantic scene completion. In ECAI 2020, pages 2824–2831. IOS Press, 2020.
  84. Devnet: Self-supervised monocular depth learning via density volume construction. In ECCV, pages 125–142. Springer, 2022.
  85. Unsupervised learning of depth and ego-motion from video. In CVPR, pages 1851–1858, 2017.
  86. R-MSFM: Recurrent multi-scale feature modulation for monocular depth estimating. In ICCV, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yuanhui Huang (14 papers)
  2. Wenzhao Zheng (64 papers)
  3. Borui Zhang (15 papers)
  4. Jie Zhou (687 papers)
  5. Jiwen Lu (192 papers)
Citations (47)

Summary

An Expert Overview of "SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction"

The paper "SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction" presents a notable contribution to the domain of vision-centric autonomous driving. It addresses the challenge of 3D occupancy prediction using only video sequences, stepping away from traditional methods reliant on labor-intensive 3D annotations. This research builds on the premise that a self-supervised approach leveraging temporal and spatial data from video inputs can yield high-precision 3D space reconstructions, a crucial requirement for the autonomous driving systems.

Key Contributions and Methodology

A significant offering of this paper is SelfOcc, a framework designed to predict 3D occupancy from video sequences in a self-supervised manner. The approach is innovatively centered around exploiting both bird's eye view (BEV) and tri-perspective view (TPV) to form 3D spatial representations from 2D image data. Crucially, this transforms the problem space into a signed distance function (SDF) field rather than density fields, a deviation that allows the imposition of more meaningful geometric constraints.

Central to SelfOcc is the introduction of an MVS-embedded strategy, optimizing weights induced by the SDF through multiple depth proposals, which enhances depth prediction fidelity and occupancy accuracy. This strategy is embedded within the framework of neural implicit surface reconstruction mechanisms derived from pre-existing methods, such as NeuS, to refine its occupancy predictions.

The proposed framework's efficacy surpasses the previous best-performing method, SceneRF, demonstrating a 58.7% performance increase using the IoU metric within the SemanticKITTI benchmark. Furthermore, SelfOcc establishes itself as the first self-supervised framework to achieve reasonable 3D occupancy predictions from surround-view cameras on the nuScenes dataset, underlining its robustness across diverse datasets.

Results and Implications

SelfOcc sets a new standard for self-supervised 3D occupancy prediction, showcasing versatility across several tasks: novel depth synthesis, monocular depth estimation, and surround-view depth estimation. Notably, it delivered state-of-the-art results in novel depth synthesis on the SemanticKITTI, KITTI-2015, and nuScenes datasets. As a self-supervised method, it obviates the need for 3D labels, promising significant cost reductions in dataset preparation and augmentation in real-world applications.

These insights underscore substantial theoretical implications, primarily the potential to redefine current paradigms in 3D spatial reasoning for autonomous systems. The approach's ability to infer occluded scene components and integrate temporal consistency from video sequences reveals a nuanced understanding of dynamic environments, which is critical for autonomous vehicle safety and efficiency.

Conclusion and Future Directions

The self-supervised nature of SelfOcc paves new pathways for resource-efficient training of autonomous driving models. The research completes a foundational step towards scalable and adaptable autonomous systems capable of leveraging self-supervised learning paradigms. Future investigations should consider incorporating motion awareness to further refine object tracking and environmental interaction understanding. Moreover, enhancements in view synthesis quality remain a promising yet challenging direction for ongoing research.

Overall, this paper presents a comprehensive and technically rigorous approach to advancing self-supervised methodologies in 3D perception, contributing substantial theoretical and practical advancements to the field of computer vision and autonomous driving.

Github Logo Streamline Icon: https://streamlinehq.com