Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction (2404.09502v1)

Published 15 Apr 2024 in cs.CV

Abstract: Vision-based perception for autonomous driving requires an explicit modeling of a 3D space, where 2D latent representations are mapped and subsequent 3D operators are applied. However, operating on dense latent spaces introduces a cubic time and space complexity, which limits scalability in terms of perception range or spatial resolution. Existing approaches compress the dense representation using projections like Bird's Eye View (BEV) or Tri-Perspective View (TPV). Although efficient, these projections result in information loss, especially for tasks like semantic occupancy prediction. To address this, we propose SparseOcc, an efficient occupancy network inspired by sparse point cloud processing. It utilizes a lossless sparse latent representation with three key innovations. Firstly, a 3D sparse diffuser performs latent completion using spatially decomposed 3D sparse convolutional kernels. Secondly, a feature pyramid and sparse interpolation enhance scales with information from others. Finally, the transformer head is redesigned as a sparse variant. SparseOcc achieves a remarkable 74.9% reduction on FLOPs over the dense baseline. Interestingly, it also improves accuracy, from 12.8% to 14.1% mIOU, which in part can be attributed to the sparse representation's ability to avoid hallucinations on empty voxels.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In ICCV, 2019a.
  2. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In ICCV, 2019b.
  3. Nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
  4. Anh-Quan Cao and Raoul de Charette. Monoscene: Monocular 3d semantic scene completion. In CVPR, 2022.
  5. 3d sketch-aware semantic scene completion via semi-supervised structure prior. In CVPR, 2020.
  6. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
  7. Spconv Contributors. Spconv: Spatially sparse convolution library. https://github.com/traveller59/spconv, 2022.
  8. Metabev: Solving sensor failures for 3d detection and map segmentation. In ICCV, 2023.
  9. Deep residual learning for image recognition. In CVPR, 2016.
  10. Fiery: future instance prediction in bird’s-eye view from surround monocular cameras. In ICCV, 2021.
  11. Planning-oriented autonomous driving. In CVPR, 2023.
  12. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv:2112.11790, 2021.
  13. Tri-perspective view for vision-based 3d semantic occupancy prediction. arXiv:2302.07817, 2023.
  14. Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. In CVPR, 2023.
  15. Pointpillars: Fast encoders for object detection from point clouds. In CVPR, 2019.
  16. Anisotropic convolutional networks for 3d semantic scene completion. In CVPR, 2020.
  17. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. arXiv:2206.10092, 2022a.
  18. Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. In CVPR, 2023.
  19. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. arXiv:2203.17270, 2022b.
  20. Feature pyramid networks for object detection. In CVPR, 2017.
  21. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  22. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. arXiv:2205.13542, 2022.
  23. Decoupled weight decay regularization. In ICLR, 2019.
  24. Bevsegformer: Bird’s eye view semantic segmentation from arbitrary camera rigs. arXiv:2203.04050, 2022.
  25. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, 2020.
  26. Predicting semantic map representations from images using pyramid occupancy networks. In CVPR, 2020.
  27. Lmscnet: Lightweight multiscale 3d semantic completion. In 3DV, 2020.
  28. Semantic scene completion from a single depth image. In CVPR, 2017.
  29. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.
  30. Prototransfer: Cross-modal prototype transfer for point cloud segmentation. In ICCV, 2023.
  31. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. arXiv:2304.14365, 2023.
  32. Scene as occupancy. In ICCV, 2023.
  33. Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. In ICCV, 2023.
  34. Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In CVPR, 2023.
  35. Scpnet: Semantic scene completion on point cloud. In CVPR, 2023.
  36. Rclane: Relay chain prediction for lane detection. In ECCV, 2022.
  37. Sparse single sweep lidar point cloud segmentation via learning contextual shape priors from scene completion. In AAAI, 2021.
  38. 2dpass: 2d priors assisted semantic segmentation on lidar point clouds. In ECCV, 2022.
  39. Second: Sparsely embedded convolutional detection. Sensors, 2018.
  40. Alleviating foreground sparsity for semi-supervised monocular 3d object detection. In WACV, 2024.
  41. Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. arXiv:2304.05316, 2023.
  42. Cross-view transformers for real-time map-view semantic segmentation. In CVPR, 2022.
  43. Unidistill: A universal cross-modality knowledge distillation framework for 3d object detection in bird’s-eye view. In CVPR, 2023.
  44. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.
  45. Deformable DETR: Deformable transformers for end-to-end object detection. In ICLR, 2021a.
  46. Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. In CVPR, 2021b.
  47. Pointocc: Cylindrical tri-perspective view for point-based 3d semantic occupancy prediction. arXiv:2308.16896, 2023.
Citations (20)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com