SG-BEV: Satellite-Guided BEV Fusion for Cross-View Semantic Segmentation (2404.02638v1)
Abstract: This paper aims at achieving fine-grained building attribute segmentation in a cross-view scenario, i.e., using satellite and street-view image pairs. The main challenge lies in overcoming the significant perspective differences between street views and satellite views. In this work, we introduce SG-BEV, a novel approach for satellite-guided BEV fusion for cross-view semantic segmentation. To overcome the limitations of existing cross-view projection methods in capturing the complete building facade features, we innovatively incorporate Bird's Eye View (BEV) method to establish a spatially explicit mapping of street-view features. Moreover, we fully leverage the advantages of multiple perspectives by introducing a novel satellite-guided reprojection module, optimizing the uneven feature distribution issues associated with traditional BEV methods. Our method demonstrates significant improvements on four cross-view datasets collected from multiple cities, including New York, San Francisco, and Boston. On average across these datasets, our method achieves an increase in mIOU by 10.13% and 5.21% compared with the state-of-the-art satellite-based and cross-view methods. The code and datasets of this work will be released at https://github.com/yejy53/SG-BEV.
- Roadtracer: Automatic extraction of road networks from aerial images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4720–4728, 2018.
- Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.
- Semiroadexnet: A semi-supervised network for road extraction from remote sensing imagery via adversarial learning. ISPRS Journal of Photogrammetry and Remote Sensing, 198:169–183, 2023.
- Cvcmff net: Complex-valued convolutional and multifeature fusion network for building semantic segmentation of insar images. IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2021.
- The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
- SuperFusion: Multilevel LiDAR-Camera Fusion for Long-Range HD Map Generation. arXiv preprint arXiv:2211.15656, 2022.
- Urban zoning using higher-order markov random fields on multi-view imagery data. In Proceedings of the European Conference on Computer Vision (ECCV), pages 614–630, 2018.
- Segnext: Rethinking convolutional attention design for semantic segmentation. Advances in Neural Information Processing Systems, 35:1140–1156, 2022.
- Polarformer: Multi-camera 3d object detection with polar transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 1042–1050, 2023.
- Algorithms for semantic segmentation of multispectral remote sensing imagery using deep learning. ISPRS journal of photogrammetry and remote sensing, 145:60–77, 2018.
- Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019.
- Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe. arXiv preprint arXiv:2209.05324, 2022.
- Hdmapnet: An online hd map construction and evaluation framework. In 2022 International Conference on Robotics and Automation (ICRA), pages 4628–4634. IEEE, 2022.
- Omnicity: Omnipotent city understanding with multi-level and multi-view images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17397–17407, June 2023.
- Joint semantic–geometric learning for polygonal building segmentation from high-resolution remote sensing images. ISPRS Journal of Photogrammetry and Remote Sensing, 201:26–37, 2023.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision, pages 1–18. Springer, 2022.
- Cross-view image geolocalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 891–898, 2013.
- Seeing beyond the patch: Scale-adaptive semantic segmentation of high-resolution remote sensing imagery based on reinforcement learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16868–16878, 2023.
- Geometry-aware satellite-to-ground image synthesis for urban areas. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 859–867, 2020.
- Cross-view semantic segmentation for sensing surroundings. IEEE Robotics and Automation Letters, 5(3):4867–4873, 2020.
- Bevsegformer: Bird’s eye view semantic segmentation from arbitrary camera rigs. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5935–5943, 2023.
- Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 194–210. Springer, 2020.
- A sim2real deep learning approach for the transformation of images from multiple vehicle-mounted cameras to a semantically segmented image in bird’s eye view. In 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), pages 1–7. IEEE, 2020.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- The isprs benchmark on urban object classification and 3d building reconstruction. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences; I-3, 1(1):293–298, 2012.
- Beyond cross-view image retrieval: Highly accurate vehicle localization using satellite image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17010–17020, 2022.
- Spatial-aware feature aggregation for image based cross-view geo-localization. Advances in Neural Information Processing Systems, 32, 2019.
- Boosting 3-dof ground-to-satellite camera localization accuracy via geometry-guided cross-view transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21516–21526, 2023.
- Where am i looking at? joint location and orientation estimation by cross-view matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4064–4072, 2020.
- Deepmao: Deep multi-scale aware overcomplete network for building segmentation in satellite imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 487–496, 2023.
- Understanding urban landuse from the above and ground perspectives: A deep learning, multimodal solution. Remote sensing of environment, 228:129–143, 2019.
- 360bev: Panoramic semantic mapping for indoor bird’s-eye view. In 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024.
- Coming down to earth: Satellite-to-street view synthesis for geo-localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6488–6497, 2021.
- Fine-grained cross-view geo-localization using a correlation-aware homography estimator. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning, pages 180–191. PMLR, 2022.
- Holistic multi-view building analysis in the wild with projection pooling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2870–2878, 2021.
- Revisiting near/remote sensing with geospatial attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1778–1787, 2022.
- A unified model for near and remote sensing. In Proceedings of the IEEE International Conference on Computer Vision, pages 2688–2697, 2017.
- Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2945–2954, 2023.
- Rssformer: Foreground saliency enhancement for remote sensing land-cover segmentation. IEEE Transactions on Image Processing, 32:1052–1064, 2023.
- Parametric depth based feature representation learning for object detection and segmentation in bird’s-eye view. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8483–8492, 2023.
- Foreground-aware relation network for geospatial object segmentation in high spatial resolution remote sensing imagery. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4096–4105, 2020.
- Transgeo: Transformer is all you need for cross-view image geo-localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1162–1171, 2022.
- Vigor: Cross-view image geo-localization beyond one-to-one retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3640–3649, 2021.