Bird's-Eye View to Street-View: A Survey (2405.08961v1)
Abstract: In recent years, street view imagery has grown to become one of the most important sources of geospatial data collection and urban analytics, which facilitates generating meaningful insights and assisting in decision-making. Synthesizing a street-view image from its corresponding satellite image is a challenging task due to the significant differences in appearance and viewpoint between the two domains. In this study, we screened 20 recent research papers to provide a thorough review of the state-of-the-art of how street-view images are synthesized from their corresponding satellite counterparts. The main findings are: (i) novel deep learning techniques are required for synthesizing more realistic and accurate street-view images; (ii) more datasets need to be collected for public usage; and (iii) more specific evaluation metrics need to be investigated for evaluating the generated images appropriately. We conclude that, due to applying outdated deep learning techniques, the recent literature failed to generate detailed and diverse street-view images.
- X. Lu, Z. Li, Z. Cui, M. R. Oswald, M. Pollefeys, and R. Qin, “Geometry-aware satellite-to-ground image synthesis for urban areas,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 859–867.
- S. Workman, R. Souvenir, and N. Jacobs, “Wide-area image geolocalization with aerial reference imagery,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3961–3969.
- D. J. Crandall, L. Backstrom, D. Huttenlocher, and J. Kleinberg, “Mapping the world’s photos,” in Proceedings of the 18th international conference on World wide web, 2009, pp. 761–770.
- A. Toker, Q. Zhou, M. Maximov, and L. Leal-Taixé, “Coming down to earth: Satellite-to-street view synthesis for geo-localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6488–6497.
- M. Zhai, Z. Bessinger, S. Workman, and N. Jacobs, “Predicting ground-level scene layout from aerial imagery,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 867–875.
- X. Deng, Y. Zhu, and S. Newsam, “What is it like down there? generating dense ground-level views and image features from overhead imagery using conditional generative adversarial networks,” in Proceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2018, pp. 43–52.
- Y. Shi, D. Campbell, X. Yu, and H. Li, “Geometry-guided street-view panorama synthesis from satellite imagery,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 10 009–10 022, 2022.
- S. Wu, H. Tang, X.-Y. Jing, J. Qian, N. Sebe, Y. Yan, and Q. Zhang, “Cross-view panorama image synthesis with progressive attention gans,” Pattern Recognition, vol. 131, p. 108884, 2022.
- B. Ren, H. Tang, and N. Sebe, “Cascaded cross mlp-mixer gans for cross-view image translation,” arXiv preprint arXiv:2110.10183, 2021.
- O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, 2015, pp. 234–241.
- K. Regmi and A. Borji, “Cross-view image synthesis using conditional gans,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 3501–3510.
- H. Tang, D. Xu, Y. Yan, P. H. Torr, and N. Sebe, “Local class-specific and global image-level generative adversarial networks for semantic-guided scene generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- H. Tang, D. Xu, N. Sebe, Y. Wang, J. J. Corso, and Y. Yan, “Multi-channel attention selection gan with cascaded semantic guidance for cross-view image translation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2417–2426.
- K. Regmi and M. Shah, “Bridging the domain gap for ground-to-aerial image matching,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 470–479.
- Y. Shi, X. Yu, D. Campbell, and H. Li, “Where am i looking at? joint location and orientation estimation by cross-view matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4064–4072.
- T.-Y. Lin, Y. Cui, S. Belongie, and J. Hays, “Learning deep representations for ground-to-aerial geolocalization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
- S. Hu, M. Feng, R. M. Nguyen, and G. H. Lee, “Cvm-net: Cross-view matching network for image-based ground-to-aerial geo-localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7258–7267.
- L. Liu and H. Li, “Lending orientation to neural networks for cross-view geo-localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- N. N. Vo and J. Hays, “Localizing and orienting street views using overhead imagery,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer, 2016, pp. 494–509.
- A. Swerdlow, R. Xu, and B. Zhou, “Street-view image generation from a bird’s-eye view layout,” IEEE Robotics and Automation Letters, 2024.
- H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631.
- J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman, “Toward multimodal image-to-image translation,” Advances in neural information processing systems, vol. 30, 2017.
- P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134.
- X. Zhu, Z. Yin, J. Shi, H. Li, and D. Lin, “Generative adversarial frontal view to bird view synthesis,” in 2018 International conference on 3D Vision (3DV). IEEE, 2018, pp. 454–463.
- J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.
- T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning to discover cross-domain relations with generative adversarial networks,” in International conference on machine learning. PMLR, 2017, pp. 1857–1865.
- M.-Y. Liu and O. Tuzel, “Coupled generative adversarial networks,” Advances in neural information processing systems, vol. 29, 2016.
- A. Palazzi, G. Borghi, D. Abati, S. Calderara, and R. Cucchiara, “Learning to map vehicles into bird’s eye view,” in Image Analysis and Processing-ICIAP 2017: 19th International Conference, Catania, Italy, September 11-15, 2017, Proceedings, Part I 19. Springer, 2017, pp. 233–243.
- S. Ardeshir and A. Borji, “Ego2top: Matching viewers in egocentric and top-view videos,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer, 2016, pp. 253–268.
- M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223.
- J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros, “Generative visual manipulation on the natural image manifold,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer, 2016, pp. 597–613.
- A. Yu and K. Grauman, “Fine-grained visual comparisons with local learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 192–199.
- M. Eitz, J. Hays, and M. Alexa, “How do humans sketch objects?” ACM Transactions on graphics (TOG), vol. 31, no. 4, pp. 1–10, 2012.
- P.-Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays, “Transient attributes for high-level understanding and editing of outdoor scenes,” ACM Transactions on graphics (TOG), vol. 33, no. 4, pp. 1–11, 2014.
- A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
- S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning. pmlr, 2015, pp. 448–456.
- R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14. Springer, 2016, pp. 649–666.
- A. Bansal, X. Chen, B. Russell, A. Gupta, and D. Ramanan, “Pixelnet: Towards a general pixel-level architecture,” arXiv preprint arXiv:1609.06694, 2016.
- X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2010, pp. 249–256.
- S. Workman and N. Jacobs, “On the location dependence of convolutional neural network features,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2015, pp. 70–78.
- B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene recognition using places database,” Advances in neural information processing systems, vol. 27, 2014.
- S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1. IEEE, 2005, pp. 539–546.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017.
- R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 1735–1742.
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 675–678.
- S. S. Baraheem, T.-N. Le, and T. V. Nguyen, “Image synthesis: a review of methods, datasets, evaluation metrics, and future outlook,” Artificial Intelligence Review, pp. 1–53, 2023.
- U. Sara, M. Akter, and M. S. Uddin, “Image quality assessment through fsim, ssim, mse and psnr—a comparative study,” Journal of Computer and Communications, vol. 7, no. 3, pp. 8–18, 2019.
- E. Betzalel, C. Penso, A. Navon, and E. Fetaya, “A study on the evaluation of generative models,” arXiv e-prints, pp. arXiv–2206, 2022.
- S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: A survey,” ACM computing surveys (CSUR), vol. 54, no. 10s, pp. 1–41, 2022.