Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection (2308.13794v3)

Published 26 Aug 2023 in cs.CV

Abstract: In the field of autonomous driving, accurate and comprehensive perception of the 3D environment is crucial. Bird's Eye View (BEV) based methods have emerged as a promising solution for 3D object detection using multi-view images as input. However, existing 3D object detection methods often ignore the physical context in the environment, such as sidewalk and vegetation, resulting in sub-optimal performance. In this paper, we propose a novel approach called SOGDet (Semantic-Occupancy Guided Multi-view 3D Object Detection), that leverages a 3D semantic-occupancy branch to improve the accuracy of 3D object detection. In particular, the physical context modeled by semantic occupancy helps the detector to perceive the scenes in a more holistic view. Our SOGDet is flexible to use and can be seamlessly integrated with most existing BEV-based methods. To evaluate its effectiveness, we apply this approach to several state-of-the-art baselines and conduct extensive experiments on the exclusive nuScenes dataset. Our results show that SOGDet consistently enhance the performance of three baseline methods in terms of nuScenes Detection Score (NDS) and mean Average Precision (mAP). This indicates that the combination of 3D object detection and 3D semantic occupancy leads to a more comprehensive perception of the 3D environment, thereby aiding build more robust autonomous driving systems. The codes are available at: https://github.com/zhouqiu/SOGDet.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. A survey on 3d object detection methods for autonomous driving applications. IEEE Transactions on Intelligent Transportation Systems, 20(10):3782–3795, 2019.
  2. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In CVPR, pages 4413–4421, 2018.
  3. nuscenes: A multimodal dataset for autonomous driving. In CVPR, pages 11621–11631, 2020.
  4. Monocular 3d object detection with decoupled structured polygon estimation and height-guided depth estimation. In AAAI, pages 10478–10485, 2020.
  5. Anh-Quan Cao and Raoul de Charette. Monoscene: Monocular 3d semantic scene completion. In CVPR, pages 3991–4001, 2022.
  6. Shapeconv: Shape-aware convolutional layer for indoor rgb-d semantic segmentation. In ICCV, pages 7088–7097, 2021.
  7. End-to-end object detection with transformers. In ECCV, pages 213–229. Springer, 2020.
  8. Multi-view 3d object detection network for autonomous driving. In CVPR, pages 1907–1915, 2017.
  9. Graph-detr3d: rethinking overlapping regions for multi-view 3d object detection. In ACM MM, pages 5999–6008, 2022.
  10. Learning depth-guided convolutions for monocular 3d object detection. In CVPR, pages 1000–1001, 2020.
  11. Zhiqi Li Fang Ming. Occupancy dataset for nuscenes, 2023. https://github.com/FANG-MING/occupancy-for-nuscenes.
  12. Aedet: Azimuth-invariant multi-view 3d object detection. arXiv preprint arXiv:2211.12501, 2022.
  13. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  14. Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. arXiv preprint arXiv:2203.17054, 2022.
  15. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
  16. Tri-perspective view for vision-based 3d semantic occupancy prediction. arXiv preprint arXiv:2302.07817, 2023.
  17. X3kd: Knowledge distillation across modalities, tasks and stages for multi-camera 3d object detection. In CVPR, pages 13343–13353, 2023.
  18. Groomed-nms: Grouped mathematically differentiable nms for monocular 3d object detection. In CVPR, pages 8973–8983, 2021.
  19. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. arXiv preprint arXiv:2206.10092, 2022.
  20. Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. arXiv preprint arXiv:2302.12251, 2023.
  21. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, pages 1–18. Springer, 2022.
  22. Feature pyramid networks for object detection. In CVPR, pages 2117–2125, 2017.
  23. Focal loss for dense object detection. In ICCV, pages 2980–2988, 2017.
  24. Petr: Position embedding transformation for multi-view 3d object detection. In ECCV, pages 531–548. Springer, 2022.
  25. Petrv2: A unified framework for 3d perception from multi-camera images. In ICCV, 2023.
  26. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021.
  27. Learning ego 3d representation as ray tracing. In ECCV, pages 129–144. Springer, 2022.
  28. Occdepth: A depth-aware method for 3d semantic scene completion. arXiv preprint arXiv:2302.13540, 2023.
  29. Is pseudo-lidar needed for monocular 3d object detection? In ICCV, pages 3142–3152, 2021.
  30. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, pages 194–210. Springer, 2020.
  31. Categorical depth distribution network for monocular 3d object detection. In CVPR, pages 8555–8564, 2021.
  32. Grid-centric traffic scenario perception for autonomous driving: A comprehensive review. arXiv preprint arXiv:2303.01212, 2023.
  33. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, pages 2446–2454, 2020.
  34. Fcos: Fully convolutional one-stage object detection. In ICCV, pages 9627–9636, 2019.
  35. Probabilistic and geometric depth: Detecting objects in perspective. In Conference on Robot Learning, pages 1475–1485. PMLR, 2022.
  36. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In ICCV, pages 913–922, 2021.
  37. Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. arXiv preprint arXiv:2303.03991, 2023.
  38. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning, pages 180–191. PMLR, 2022.
  39. Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. arXiv preprint arXiv:2303.09551, 2023.
  40. Cross modal transformer via coordinates encoding for 3d object dectection. In ICCV, 2023.
  41. Center-based 3d object detection and tracking. In CVPR, pages 11784–11793, 2021.
  42. Class-balanced grouping and sampling for point cloud 3d object detection. arXiv preprint arXiv:1908.09492, 2019.
  43. Panoptic nuscenes: A large-scale benchmark for lidar panoptic segmentation and tracking. In IEEE Robotics and Automation Letters, 7(2):3795–3802, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Qiu Zhou (2 papers)
  2. Jinming Cao (7 papers)
  3. Hanchao Leng (2 papers)
  4. Yifang Yin (24 papers)
  5. Yu Kun (1 paper)
  6. Roger Zimmermann (76 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.