Convolutional Cross-View Pose Estimation (2303.05915v3)
Abstract: We propose a novel end-to-end method for cross-view pose estimation. Given a ground-level query image and an aerial image that covers the query's local neighborhood, the 3 Degrees-of-Freedom camera pose of the query is estimated by matching its image descriptor to descriptors of local regions within the aerial image. The orientation-aware descriptors are obtained by using a translationally equivariant convolutional ground image encoder and contrastive learning. The Localization Decoder produces a dense probability distribution in a coarse-to-fine manner with a novel Localization Matching Upsampling module. A smaller Orientation Decoder produces a vector field to condition the orientation estimate on the localization. Our method is validated on the VIGOR and KITTI datasets, where it surpasses the state-of-the-art baseline by 72% and 36% in median localization error for comparable orientation estimation accuracy. The predicted probability distribution can represent localization ambiguity, and enables rejecting possible erroneous predictions. Without re-training, the model can infer on ground images with different field of views and utilize orientation priors if available. On the Oxford RobotCar dataset, our method can reliably estimate the ego-vehicle's pose over time, achieving a median localization error under 1 meter and a median orientation error of around 1 degree at 14 FPS.
- S. Thrun, “Probabilistic Robotics,” Communications of the ACM, vol. 45, no. 3, pp. 52–57, 2002.
- B. Ben-Moshe, E. Elkin et al., “Improving accuracy of GNSS devices in urban canyons,” in CCCG, 2011, pp. 511–515.
- C. Chen, B. Wang, C. X. Lu, N. Trigoni, and A. Markham, “A survey on deep learning for localization and mapping: Towards the age of spatial machine intelligence,” arXiv preprint arXiv:2006.12567, 2020.
- S. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, and M. J. Milford, “Visual place recognition: A survey,” IEEE T-RO, vol. 32, no. 1, pp. 1–19, 2015.
- J. Janai, F. Güney, A. Behl, A. Geiger et al., “Computer vision for autonomous vehicles: Problems, datasets and state of the art,” Foundations and Trends® in Computer Graphics and Vision, vol. 12, no. 1–3, pp. 1–308, 2020.
- P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, “From coarse to fine: Robust hierarchical localization at large scale,” in Proc. of IEEE/CVF CVPR, 2019, pp. 12 716–12 725.
- I. A. Barsan, S. Wang, A. Pokrovsky, and R. Urtasun, “Learning to localize using a lidar intensity map,” in CoRL, 10 2018.
- X. Wei, I. A. Bârsan, S. Wang, J. Martinez, and R. Urtasun, “Learning to localize through compressed binary maps,” in Proc. of IEEE/CVF CVPR, 2019, pp. 10 316–10 324.
- W. Lu, Y. Zhou, G. Wan, S. Hou, and S. Song, “L3-Net: Towards learning based lidar localization for autonomous driving,” in Proc. of IEEE/CVF CVPR, 2019, pp. 6389–6398.
- H. Wang, C. Xue, Y. Zhou, F. Wen, and H. Zhang, “Visual semantic localization based on HD map for autonomous vehicles in urban scenarios,” in IEEE ICRA, 2021, pp. 11 255–11 261.
- C. Guo, M. Lin, H. Guo, P. Liang, and E. Cheng, “Coarse-to-fine semantic localization with HD map for autonomous driving in structural scenes,” in 2021 IEEE/RSJ IROS, 2021, pp. 1146–1153.
- T.-Y. Lin, Y. Cui, S. Belongie, and J. Hays, “Learning deep representations for ground-to-aerial geolocalization,” in Proc. of IEEE/CVF CVPR, 2015, pp. 5007–5015.
- S. Workman, R. Souvenir, and N. Jacobs, “Wide-area image geolocalization with aerial reference imagery,” in Proc. of IEEE/CVF ICCV, 2015, pp. 3961–3969.
- S. Workman and N. Jacobs, “On the location dependence of convolutional neural network features,” in Proc. of IEEE/CVF CVPR Workshops, 2015, pp. 70–78.
- Y. Shi, L. Liu, X. Yu, and H. Li, “Spatial-aware feature aggregation for image based cross-view geo-localization,” in NeurIPS, 2019, pp. 10 090–10 100.
- S. Hu, M. Feng, R. M. Nguyen, and G. Hee Lee, “CVM-net: Cross-view matching network for image-based ground-to-aerial geo-localization,” in Proc. of IEEE/CVF CVPR, 2018, pp. 7258–7267.
- K. Regmi and M. Shah, “Bridging the domain gap for ground-to-aerial image matching,” in Proc. of IEEE/CVF ICCV, 2019, pp. 470–479.
- S. Cai, Y. Guo, S. Khan, J. Hu, and G. Wen, “Ground-to-aerial image geo-localization with a hard exemplar reweighting triplet loss,” in Proc. of IEEE/CVF ICCV, 2019, pp. 8391–8400.
- Y. Shi, X. Yu, L. Liu et al., “Optimal feature transport for cross-view image geo-localization,” in Proc. of AAAI, 2020, pp. 11 990–11 997.
- S. Zhu, T. Yang, and C. Chen, “Revisiting street-to-aerial view image geo-localization and orientation estimation,” in Proc. of IEEE/CVF WACV, 2021, pp. 756–765.
- A. Toker, Q. Zhou, M. Maximov, and L. Leal-Taixe, “Coming down to earth: Satellite-to-street view synthesis for geo-localization,” in Proc. of IEEE/CVF CVPR, June 2021, pp. 6488–6497.
- S. Zhu, T. Yang, and C. Chen, “VIGOR: Cross-view image geo-localization beyond one-to-one retrieval,” in Proc. of IEEE/CVF CVPR, 2021, pp. 3640–3649.
- H. Yang, X. Lu, and Y. Zhu, “Cross-view geo-localization with layer-to-layer transformer,” in NeurIPS, 2021, pp. 29 009–29 020.
- S. Zhu, M. Shah, and C. Chen, “TransGeo: Transformer is all you need for cross-view image geo-localization,” in Proc. of IEEE/CVF CVPR, 2022, pp. 1162–1171.
- R. Rodrigues and M. Tani, “Global assists local: Effective aerial representations for field of view constrained image geo-localization,” in Proc. of IEEE/CVF WACV, 2022, pp. 3871–3879.
- Z. Xia, O. Booij, M. Manfredi, and J. F. P. Kooij, “Geographically local representation learning with a spatial prior for visual localization,” in ECCV Workshops. Springer, 2020, pp. 557–573.
- Z. Xia, O. Booij, M. Manfredi, and J. F. P. Kooij, “Cross-view matching for vehicle localization by learning geographically local representations,” IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 5921–5928, 2021.
- Y. Shi and H. Li, “Beyond cross-view image retrieval: Highly accurate vehicle localization using satellite image,” in Proc. of IEEE/CVF CVPR, 2022, pp. 17 010–17 020.
- Z. Xia, O. Booij, M. Manfredi, and J. F. P. Kooij, “Visual cross-view metric localization with dense uncertainty estimates,” in ECCV. Springer, 2022, pp. 90–106.
- S. Wang, Y. Zhang, and H. Li, “Satellite image based cross-view localization for autonomous vehicle,” arXiv preprint arXiv:2207.13506, 2022.
- Y. Hou, Y. Yang, J. Wang, and M. Fu, “Road extraction assisted offset regression method in cross-view image-based geo-localization,” in IEEE ITSC. IEEE, 2022, pp. 2934–2940.
- T. Lentsch, Z. Xia, H. Caesar, and J. F. P. Kooij, “Slicematch: Geometry-guided aggregation for cross-view pose estimation,” in Proc. of IEEE/CVF CVPR, 2023, pp. 17 225–17 234.
- F. Fervers, S. Bullinger, C. Bodensteiner, M. Arens, and R. Stiefelhagen, “Uncertainty-aware vision-based metric cross-view geolocalization,” in Proc. of IEEE/CVF CVPR, 2023, pp. 21 621–21 631.
- T. G. Reid, S. E. Houts, R. Cammarata, G. Mills, S. Agarwal, A. Vora, and G. Pandey, “Localization requirements for autonomous vehicles,” SAE International Journal of Connected and Automated Vehicles, vol. 2, no. 12-02-03-0012, pp. 173–190, 2019.
- W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 year, 1000 km: The Oxford RobotCar dataset,” IJRR, vol. 36, no. 1, pp. 3–15, 2017.
- H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuScenes: A multimodal dataset for autonomous driving,” in Proc. of IEEE/CVF CVPR, 2020, pp. 11 621–11 631.
- A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” IJRR, 2013.
- L. Liu and H. Li, “Lending orientation to neural networks for cross-view geo-localization,” in Proc. of IEEE/CVF CVPR, 2019, pp. 5624–5633.
- Y. Shi, X. Yu, L. Liu, D. Campbell, P. Koniusz, and H. Li, “Accurate 3-DoF camera geo-localization via ground-to-satellite image matching,” IEEE T-PAMI, 2022.
- S. Li, Z. Tu, Y. Chen, and T. Yu, “Multi-scale attention encoder for street-to-aerial image geo-localization,” CAAI Transactions on Intelligence Technology, 2022.
- K. Regmi and A. Borji, “Cross-view image synthesis using conditional GANs,” in Proc. of IEEE/CVF CVPR, 2018, pp. 3501–3510.
- R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD: CNN architecture for weakly supervised place recognition,” in Proc. of IEEE/CVF CVPR, 2016, pp. 5297–5307.
- T. Wang, Z. Zheng, C. Yan, J. Zhang, Y. Sun, B. Zheng, and Y. Yang, “Each part matters: Local patterns facilitate cross-view geo-localization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 2, pp. 867–879, 2021.
- Y. Shi, X. Yu, S. Wang, and H. Li, “CVLNet: Cross-view semantic correspondence learning for video-based camera localization,” in Proc. of IEEE/CVF ACCV, 2022, pp. 652–669.
- N. N. Vo and J. Hays, “Localizing and orienting street views using overhead imagery,” in ECCV. Springer, 2016, pp. 494–509.
- M. Zhai, Z. Bessinger, S. Workman, and N. Jacobs, “Predicting ground-level scene layout from aerial imagery,” in Proc. of IEEE/CVF CVPR, 2017, pp. 867–875.
- Y. Shi, X. Yu, D. Campbell, and H. Li, “Where am I looking at? joint location and orientation estimation by cross-view matching,” in Proc. of IEEE/CVF CVPR, 2020, pp. 4064–4072.
- S. Hu and G. H. Lee, “Image-based geo-localization using satellite imagery,” IJCV, pp. 1–15, 2019.
- L. M. Downes, D.-K. Kim, T. J. Steiner, and J. P. How, “City-wide street-to-satellite image geolocalization of a mobile ground agent,” in IEEE/RSJ IROS, 2022, pp. 11 102–11 108.
- L. M. Downes, T. J. Steiner, R. L. Russell, and J. P. How, “Wide-area geolocalization with a limited field of view camera,” arXiv preprint arXiv:2209.11854, 2022.
- W. Hu, Y. Zhang, Y. Liang, Y. Yin, A. Georgescu, A. Tran, H. Kruppa, S.-K. Ng, and R. Zimmermann, “Beyond geo-localization: Fine-grained orientation of street-view images by cross-view matching with satellite imagery,” in Proc. of ACM Multimedia, 2022, pp. 6155–6164.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2020.
- H. Howard-Jenkins, J.-R. Ruiz-Sarmiento, and V. A. Prisacariu, “LaLaLoc: Latent layout localisation in dynamic, unvisited environments,” in Proc. of IEEE/CVF ICCV, 2021, pp. 10 107–10 116.
- H. Howard-Jenkins and V. A. Prisacariu, “LaLaLoc++: Global floor plan comprehension for layout localisation in unvisited environments,” in ECCV. Springer, 2022, pp. 693–709.
- O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in MICCAI. Springer, 2015, pp. 234–241.
- Z. Min, N. Khosravan, Z. Bessinger, M. Narayana, S. B. Kang, E. Dunn, and I. Boyadzhiev, “LASER: Latent space rendering for 2d visual localization,” in Proc. of IEEE/CVF CVPR, 2022, pp. 11 122–11 131.
- A. Dosovitskiy et al., “FlowNet: Learning optical flow with convolutional networks,” in Proc. of IEEE/CVF ICCV, 2015.
- N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” in Proc. of IEEE/CVF CVPR, 2016, pp. 4040–4048.
- D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “PWC-Net: Cnns for optical flow using pyramid, warping, and cost volume,” in Proc. of IEEE/CVF CVPR, 2018, pp. 8934–8943.
- A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
- G. Monge, “Mémoire sur la théorie des déblais et des remblais,” Mem. Math. Phys. Acad. Royale Sci., pp. 666–704, 1781.
- C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio, “Learning with a wasserstein loss,” in NeurIPS, vol. 28, 2015.
- W. Maddern, G. Pascoe et al., “Real-time kinematic ground truth for the Oxford RobotCar dataset,” arXiv preprint: 2002.10152, 2020.
- M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in ICML, 2019, pp. 6105–6114.
- J. Deng, W. Dong, R. Socher et al., “Imagenet: A large-scale hierarchical image database,” in Proc. of IEEE/CVF CVPR, 2009, pp. 248–255.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ICLR, 2014.
- L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus, “Regularization of neural networks using dropconnect,” in ICML, 2013, pp. 1058–1066.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
- R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proc. of IEEE/CVF ICCV, 2017, pp. 618–626.