Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Context and Geometry Aware Voxel Transformer for Semantic Scene Completion (2405.13675v4)

Published 22 May 2024 in cs.CV

Abstract: Vision-based Semantic Scene Completion (SSC) has gained much attention due to its widespread applications in various 3D perception tasks. Existing sparse-to-dense approaches typically employ shared context-independent queries across various input images, which fails to capture distinctions among them as the focal regions of different inputs vary and may result in undirected feature aggregation of cross-attention. Additionally, the absence of depth information may lead to points projected onto the image plane sharing the same 2D position or similar sampling points in the feature map, resulting in depth ambiguity. In this paper, we present a novel context and geometry aware voxel transformer. It utilizes a context aware query generator to initialize context-dependent queries tailored to individual input images, effectively capturing their unique characteristics and aggregating information within the region of interest. Furthermore, it extend deformable cross-attention from 2D to 3D pixel space, enabling the differentiation of points with similar image coordinates based on their depth coordinates. Building upon this module, we introduce a neural network named CGFormer to achieve semantic scene completion. Simultaneously, CGFormer leverages multiple 3D representations (i.e., voxel and TPV) to boost the semantic and geometric representation abilities of the transformed 3D volume from both local and global perspectives. Experimental results demonstrate that CGFormer achieves state-of-the-art performance on the SemanticKITTI and SSCBench-KITTI-360 benchmarks, attaining a mIoU of 16.87 and 20.05, as well as an IoU of 45.99 and 48.07, respectively. Remarkably, CGFormer even outperforms approaches employing temporal images as inputs or much larger image backbone networks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9297–9307, 2019.
  2. Anh-Quan Cao and Raoul de Charette. Monoscene: Monocular 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3981–3991, 2022.
  3. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, pages 213–229, 2020.
  4. S3cnet: A sparse semantic scene completion network for lidar point clouds. In Conference on Robot Learning, pages 2148–2161, 2021.
  5. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3354–3361, 2012.
  6. Neighborhood attention transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6185–6194, 2023.
  7. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  8. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 770–778, 2016.
  9. Fiery: Future instance prediction in bird’s-eye-view from surround monocular cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15273–15282, 2021.
  10. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023.
  11. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
  12. Tri-perspective view for vision-based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9223–9232, 2023.
  13. Symphonize 3d semantic scene completion with contextual instance queries. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  14. Tod3cap: Towards 3d dense captioning in outdoor scenes. arXiv preprint arXiv:2403.19589, 2024.
  15. Stereoscene: Bev-assisted stereo matching empowers 3d semantic scene completion. arXiv preprint arXiv:2303.13959, 2023.
  16. Dfa3d: 3d deformable attention for 2d-to-3d feature lifting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6684–6693, 2023.
  17. Anisotropic convolutional networks for 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3348–3356, 2020.
  18. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1486–1494, 2023.
  19. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1477–1485, 2023.
  20. Sscbench: A large-scale 3d semantic scene completion benchmark for autonomous driving. arXiv preprint arXiv:2306.09001, 2023.
  21. Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9087–9098, 2023.
  22. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Proceedings of the European Conference on Computer Vision, pages 1–18, 2022.
  23. Bevfusion: A simple and robust lidar-camera fusion framework. Advances in Neural Information Processing Systems, pages 10421–10434, 2022.
  24. Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 3292–3310, 2022.
  25. Feature pyramid networks for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017.
  26. Petr: Position embedding transformation for multi-view 3d object detection. arXiv preprint arXiv:2203.05625, 2022.
  27. Petrv2: A unified framework for 3d perception from multi-camera images. arXiv preprint arXiv:2206.01256, 2022.
  28. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
  29. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye-view representation. In IEEE International Conference on Robotics and Automation, pages 2774–2781, 2023.
  30. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  31. Detrs beat yolos on real-time object detection. arXiv preprint arXiv:2304.08069, 2023.
  32. Camera-based 3d semantic scene completion with sparse guidance network. arXiv preprint arXiv:2312.05752, 2023.
  33. Occdepth: A depth-aware method for 3d semantic scene completion. arXiv preprint arXiv:2302.13540, 2023.
  34. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Proceedings of the European Conference on Computer Vision, pages 194–210, 2020.
  35. Orthographic feature transform for monocular 3d object detection. arXiv preprint arXiv:1811.08188, 2018.
  36. Lmscnet: Lightweight multiscale 3d semantic completion. In Proceedings of the International Conference on 3D Vision, pages 111–119, 2020.
  37. Mobilestereonet: Towards lightweight deep networks for stereo matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2417–2426, 2022.
  38. Semantic scene completion from a single depth image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 190–198, 2017.
  39. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105–6114, 2019.
  40. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. In Advances in Neural Information Processing Systems, pages 64318–64330, 2023.
  41. Not all voxels are equal: Hardness-aware semantic scene completion with self-distillation. arXiv preprint arXiv:2404.11958, 2024.
  42. Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17850–17859, 2023.
  43. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning, pages 180–191, 2021.
  44. H2gformer: Horizontal-to-global voxel transformer for 3d semantic scene completion. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5722–5730, 2024.
  45. Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21729–21740, 2023.
  46. Instance-aware monocular 3d semantic scene completion. IEEE Transactions on Intelligent Transportation Systems, 2024.
  47. Mbev: Multi-camera joint 3d detection and segmentation with unified birds-eye-view representation. arXiv preprint arXiv:2204.05088, 2022.
  48. Sparse single sweep lidar point cloud segmentation via learning contextual shape priors from scene completion. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3101–3109, 2021.
  49. Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17830–17839, 2023.
  50. Depthssc: Depth-spatial alignment and dynamic voxel resolution for monocular 3d semantic scene completion. arXiv preprint arXiv:2311.17084, 2023.
  51. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2302.14325, 2022.
  52. Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9433–9443, 2023.
  53. Monoocc: Digging into monocular semantic occupancy prediction. arXiv preprint arXiv:2403.08766, 2024.
  54. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
  55. Pointocc: Cylindrical tri-perspective view for point-based 3d semantic occupancy prediction. arXiv preprint arXiv:2308.16896, 2023.
Citations (6)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com