Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
91 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
o3 Pro
5 tokens/sec
GPT-4.1 Pro
15 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

Scene as Occupancy (2306.02851v3)

Published 5 Jun 2023 in cs.CV and cs.RO

Abstract: Human driver can easily describe the complex traffic scene by visual system. Such an ability of precise perception is essential for driver's planning. To achieve this, a geometry-aware representation that quantizes the physical 3D scene into structured grid map with semantic labels per cell, termed as 3D Occupancy, would be desirable. Compared to the form of bounding box, a key insight behind occupancy is that it could capture the fine-grained details of critical obstacles in the scene, and thereby facilitate subsequent tasks. Prior or concurrent literature mainly concentrate on a single scene completion task, where we might argue that the potential of this occupancy representation might obsess broader impact. In this paper, we propose OccNet, a multi-view vision-centric pipeline with a cascade and temporal voxel decoder to reconstruct 3D occupancy. At the core of OccNet is a general occupancy embedding to represent 3D physical world. Such a descriptor could be applied towards a wide span of driving tasks, including detection, segmentation and planning. To validate the effectiveness of this new representation and our proposed algorithm, we propose OpenOcc, the first dense high-quality 3D occupancy benchmark built on top of nuScenes. Empirical experiments show that there are evident performance gain across multiple tasks, e.g., motion planning could witness a collision rate reduction by 15%-58%, demonstrating the superiority of our method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. CES 2020 by Mobileye . https://youtu.be/HPWGFzqd7pI, 2020.
  2. Tesla AI Day. https://www.youtube.com/watch?v=j0z4FweCy4M, 2021.
  3. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In ICCV, pages 9297–9307, 2019.
  4. nuscenes: A multimodal dataset for autonomous driving. In CVPR, pages 11621–11631, 2020.
  5. Anh-Quan Cao and Raoul de Charette. Monoscene: Monocular 3d semantic scene completion. In CVPR, pages 3991–4001, 2022.
  6. Anh-Quan Cao and Raoul de Charette. Scenerf: Self-supervised monocular 3d scene reconstruction with radiance fields. arXiv preprint arXiv:2212.02501, 2022.
  7. Persformer: 3d lane detection via perspective transformer and the openlane benchmark. In ECCV, pages 550–567. Springer, 2022.
  8. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
  9. occupancy-for-nuscenes. https://github.com/FANG-MING/occupancy-for-nuscenes, 2023.
  10. Panoptic nuscenes: A large-scale benchmark for lidar panoptic segmentation and tracking. In ICRA, 2022.
  11. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004.
  12. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  13. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In ECCV, pages 533–549. Springer, 2022.
  14. Planning-oriented autonomous driving. In CVPR, 2023.
  15. Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. arXiv preprint arXiv:2203.17054, 2022.
  16. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
  17. Tri-perspective view for vision-based 3d semantic occupancy prediction. arXiv preprint arXiv:2302.07817, 2023.
  18. Hdmapnet: An online hd map construction and evaluation framework. In ICRA, pages 4628–4634. IEEE, 2022.
  19. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. arXiv preprint arXiv:2206.10092, 2022.
  20. Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. arXiv preprint arXiv:2302.12251, 2023.
  21. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, pages 1–18. Springer, 2022.
  22. Bevfusion: A simple and robust lidar-camera fusion framework. arXiv preprint arXiv:2205.13790, 2022.
  23. Feature pyramid networks for object detection. In CVPR, pages 2117–2125, 2017.
  24. Focal loss for dense object detection. In ICCV, pages 2980–2988, 2017.
  25. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. arXiv preprint arXiv:2205.13542, 2022.
  26. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  27. Occdepth: A depth-aware method for 3d semantic scene completion. arXiv:2302.13540, 2023.
  28. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  29. Rangenet++: Fast and accurate lidar semantic segmentation. In IROS, pages 4213–4220. IEEE, 2019.
  30. Grid occupancy estimation for environment perception based on belief functions and pcr6. volume 9474, 04 2015.
  31. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, July 2022.
  32. Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  33. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In CVPR, pages 10529–10538, 2020.
  34. Semantic scene completion from a single depth image. In CVPR, pages 1746–1754, 2017.
  35. Block-nerf: Scalable large scene neural view synthesis. In CVPR, pages 8248–8258, June 2022.
  36. Monocular 3d object detection with depth from motion. In European Conference on Computer Vision, pages 386–403. Springer, 2022.
  37. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In ICCV, pages 913–922, 2021.
  38. Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. arXiv preprint arXiv:2303.03991, 2023.
  39. Cross-view transformers for real-time map-view semantic segmentation. In CVPR, pages 13760–13769, 2022.
  40. Deformable DETR: Deformable transformers for end-to-end object detection. In ICLR, 2021.
  41. Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. In CVPR, pages 9939–9948, 2021.
Citations (106)

Summary

We haven't generated a summary for this paper yet.