Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Bird's Eye View Semantic Segmentation by Task Decomposition (2404.01925v1)

Published 2 Apr 2024 in cs.CV and cs.AI

Abstract: Semantic segmentation in bird's eye view (BEV) plays a crucial role in autonomous driving. Previous methods usually follow an end-to-end pipeline, directly predicting the BEV segmentation map from monocular RGB inputs. However, the challenge arises when the RGB inputs and BEV targets from distinct perspectives, making the direct point-to-point predicting hard to optimize. In this paper, we decompose the original BEV segmentation task into two stages, namely BEV map reconstruction and RGB-BEV feature alignment. In the first stage, we train a BEV autoencoder to reconstruct the BEV segmentation maps given corrupted noisy latent representation, which urges the decoder to learn fundamental knowledge of typical BEV patterns. The second stage involves mapping RGB input images into the BEV latent space of the first stage, directly optimizing the correlations between the two views at the feature level. Our approach simplifies the complexity of combining perception and generation into distinct steps, equipping the model to handle intricate and challenging scenes effectively. Besides, we propose to transform the BEV segmentation map from the Cartesian to the polar coordinate system to establish the column-wise correspondence between RGB images and BEV maps. Moreover, our method requires neither multi-scale features nor camera intrinsic parameters for depth estimation and saves computational overhead. Extensive experiments on nuScenes and Argoverse show the effectiveness and efficiency of our method. Code is available at https://github.com/happytianhao/TaDe.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Lara: Latents and rays for multi-camera bird’s-eye-view semantic segmentation. In Conference on Robot Learning, pages 1663–1672. PMLR, 2023.
  2. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
  3. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  4. Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8748–8757, 2019.
  5. Efficient and robust 2d-to-bev representation learning via geometry-guided kernel transformer. arXiv preprint arXiv:2206.04584, 2022.
  6. Polar parametrization for vision-based surround-view 3d detection. arXiv preprint arXiv:2206.10965, 2022.
  7. Symbolic discovery of optimization algorithms. Advances in Neural Information Processing Systems, 36, 2024.
  8. Bird’s eye view segmentation using lifted 2d semantic features. In British Machine Vision Conference (BMVC), pages 6985–6994, 2021.
  9. Gitnet: Geometric prior-based transformation for birds-eye-view segmentation. In European Conference on Computer Vision, pages 396–411. Springer, 2022.
  10. Skyeye: Self-supervised bird’s-eye-view semantic mapping using monocular frontal view images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14901–14910, 2023.
  11. Bird’s-eye-view panoptic segmentation using monocular frontal view images. IEEE Robotics and Automation Letters, 7(2):1968–1975, 2022.
  12. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  13. Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15273–15282, 2021.
  14. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
  15. Ddp: Diffusion model for dense visual prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21741–21752, 2023.
  16. Polarformer: Multi-camera 3d object detection with polar transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 1042–1050, 2023.
  17. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 1477–1485, 2023.
  18. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision, pages 1–18. Springer, 2022.
  19. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  20. Petr: Position embedding transformation for multi-view 3d object detection. In European Conference on Computer Vision, pages 531–548. Springer, 2022.
  21. Petrv2: A unified framework for 3d perception from multi-camera images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3262–3272, 2023.
  22. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 2774–2781. IEEE, 2023.
  23. Monocular semantic occupancy grid mapping with convolutional variational encoder–decoder networks. IEEE Robotics and Automation Letters, 4(2):445–452, 2019.
  24. Learning ego 3d representation as ray tracing. In European Conference on Computer Vision, pages 129–144. Springer, 2022.
  25. Monolayout: Amodal scene layout from a single image. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1689–1697, 2020.
  26. Cross-view semantic segmentation for sensing surroundings. IEEE Robotics and Automation Letters, 5(3):4867–4873, 2020.
  27. Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection. In The Eleventh International Conference on Learning Representations, 2022.
  28. Bevsegformer: Bird’s eye view semantic segmentation from arbitrary camera rigs. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5935–5943, 2023.
  29. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 194–210. Springer, 2020.
  30. A sim2real deep learning approach for the transformation of images from multiple vehicle-mounted cameras to a semantically segmented image in bird’s eye view. In 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), pages 1–7. IEEE, 2020.
  31. Predicting semantic map representations from images using pyramid occupancy networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11138–11147, 2020.
  32. Orthographic feature transform for monocular 3d object detection. British Machine Vision Conference, 2019.
  33. Enabling spatio-temporal aggregation in birds-eye-view vehicle estimation. In 2021 ieee international conference on robotics and automation (icra), pages 5133–5139. IEEE, 2021.
  34. Translating images into maps. In 2022 International conference on robotics and automation (ICRA), pages 9200–9206. IEEE, 2022.
  35. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  36. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning, pages 180–191. PMLR, 2022.
  37. Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17830–17839, 2023.
  38. Projecting your view attentively: Monocular road scene layout estimation via cross-view transformation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15536–15545, 2021.
  39. Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743, 2022.
  40. Cross-view transformers for real-time map-view semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13760–13769, 2022.
  41. Hft: Lifting perspective representations via hybrid feature transformation for bev perception. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7046–7053. IEEE, 2023.
  42. Diffbev: Conditional diffusion model for bird’s eye view perception. arXiv preprint arXiv:2303.08333, 2023.
Citations (5)

Summary

We haven't generated a summary for this paper yet.