Estimating Extreme 3D Image Rotation with Transformer Cross-Attention (2303.02615v2)
Abstract: The estimation of large and extreme image rotation plays a key role in multiple computer vision domains, where the rotated images are related by a limited or a non-overlapping field of view. Contemporary approaches apply convolutional neural networks to compute a 4D correlation volume to estimate the relative rotation between image pairs. In this work, we propose a cross-attention-based approach that utilizes CNN feature maps and a Transformer-Encoder, to compute the cross-attention between the activation maps of the image pairs, which is shown to be an improved equivalent of the 4D correlation volume, used in previous works. In the suggested approach, higher attention scores are associated with image regions that encode visual cues of rotation. Our approach is end-to-end trainable and optimizes a simple regression loss. It is experimentally shown to outperform contemporary state-of-the-art schemes when applied to commonly used image rotation datasets and benchmarks, and establishes a new state-of-the-art accuracy on these datasets. We make our code publicly available.
- Ringit: Ring-ordering casual photos of a temporal event. ACM Trans. Graph., 34(3), 2015.
- Shining a light on human pose: On shadows, shading and the estimation of pose and shape. In IEEE International Conference on Computer Vision (ICCV), pages 1–8, 2007.
- Relocnet: Continuous metric learning relocalisation using neural nets. In European Conference on Computer Vision (ECCV), September 2018.
- Surf: Speeded up robust features. In European Conference on Computer Vision (ECCV), pages 404–417. Springer, 2006.
- Dsac-differentiable ransac for camera localization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6684–6692, 2017.
- Minimal solutions for panoramic stitching. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8. IEEE, 2007.
- Extreme rotation estimation using dense correlation volumes. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- End-to-end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, European Conference on Computer Vision (ECCV), pages 213–229, Cham, 2020. Springer International Publishing.
- Aligning non-overlapping sequences. International Journal of Computer Vision, 48(1):39–51, 2002.
- Manhattan world: compass direction from a single image by bayesian inference. In IEEE International Conference on Computer Vision (ICCV), volume 2, pages 941–947 vol.2, 1999.
- Single view metrology. In IEEE International Conference on Computer Vision (ICCV), pages 434–441. IEEE Computer Society, 1999.
- Monoslam: Real-time single camera slam. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):1052–1067, 2007.
- Superpoint: Self-supervised interest point detection and description. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., 2018.
- Camnet: Coarse-to-fine retrieval for camera re-localization. In IEEE International Conference on Computer Vision (ICCV), October 2019.
- Flownet: Learning optical flow with convolutional networks. In IEEE International Conference on Computer Vision (ICCV), pages 2758–2766, 2015.
- D2-Net: A Trainable CNN for Joint Detection and Description of Local Features. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
- Cascade cost volume for high-resolution multi-view stereo and stereo matching. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2495–2504, 2020.
- Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004.
- Predator: Registration of 3D point clouds with low overlap. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Braft: Recurrent all-pairs field transforms for optical flow based on correlation blocks. IEEE Signal Processing Letters, 28:1575–1579, 2021.
- Algebraically accurate volume registration using euler’s theorem and the 3D pseudopolar FFT. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 795–800 vol. 2, 2005.
- Volume registration using the 3D pseudopolar fourier transform. IEEE Transactions on Signal Processing, 54(11):4323–4331, 2006.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
- Low-rank bilinear pooling for fine-grained classification. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7025–7034, Los Alamitos, CA, USA, jul 2017. IEEE Computer Society.
- Neural geometric parser for single image camera calibration. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, European Conference on Computer Vision (ECCV), pages 541–557, Cham, 2020. Springer International Publishing.
- An analysis of SVD for deep rotation estimation. Advances in Neural Information Processing Systems (NIPS), 33, 2020.
- Stereo matching with multiscale hybrid cost volume. IEEE Access, 10:100128–100136, 2022.
- Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset. In British Machine Vision Conference, 2018.
- Stereo matching using multi-level cost volume and multi-scale feature constancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
- Bilinear cnn models for fine-grained visual recognition. In IEEE International Conference on Computer Vision (ICCV), 2015.
- Rent3d: Floor-plan priors for monocular layout estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- Mfdnet: Collaborative poses perception and matrix fisher distribution for head pose estimation. IEEE Transactions on Multimedia, 24:2449–2460, 2022.
- David G Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision (ECCV), 2020.
- The StreetLearn environment and dataset. arXiv preprint arXiv:1903.01292, 2019.
- Probabilistic orientation estimation with matrix fisher distributions. Advances in Neural Information Processing Systems (NIPS), 33, 2020.
- Orb-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015.
- Multi-level context ultra-aggregation for stereo matching. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3283–3291, 2019.
- A survey of structure from motion. arXiv preprint arXiv:1701.08493, 2017.
- A reliable online method for joint estimation of focal length and camera rotation. In European Conference on Computer Vision (ECCV), page 249–265, Berlin, Heidelberg, 2022. Springer-Verlag.
- Free view synthesis. In European Conference on Computer Vision (ECCV), 2020.
- The 8-point algorithm as an inductive bias for relative pose prediction by vits. In International Conference on 3D Vision (3DV), 2022.
- Structure-from-motion revisited. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Semantic visual localization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6896–6906, 2018.
- O. Shakil. An efficient video alignment approach for non-overlapping sequences with free camera movement. In ICASSP, volume 2, pages II–II, 2006.
- Camera pose auto-encoders for improving pose regression. In European Conference on Computer Vision (ECCV), pages 140–157, Cham, 2022. Springer Nature Switzerland.
- Global-aware registration of less-overlap rgb-d scans. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6357–6366, June 2022.
- Photo sequencing. International Journal of Computer Vision, 110(3):275 – 289, 2014. Cited by: 8.
- Inloc: Indoor visual localization with dense matching and view synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7199–7209, 2018.
- Raft: Recurrent all-pairs field transforms for optical flow. In European Conference on Computer Vision (ECCV), page 402–419, Berlin, Heidelberg, 2020. Springer-Verlag.
- Visual camera re-localization using graph neural networks and relative pose supervision. In International Conference on 3D Vision (3DV), pages 145–155, Los Alamitos, CA, USA, dec 2021.
- Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems (NIPS), volume 30. Curran Associates, Inc., 2017.
- Recognizing scene viewpoint using panoramic place representation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2695–2702. IEEE, 2012.
- Learning multi-view camera relocalization with graph neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11372–11381, 2020.
- Volumetric correspondence networks for optical flow. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems (NIPS), volume 32. Curran Associates, Inc., 2019.
- Road extraction by deep residual u-net. IEEE Geoscience and Remote Sensing Letters, 15:749–753, 2018.
- Road extraction by deep residual u-net. IEEE Geoscience and Remote Sensing Letters, 15(5):749–753, 2018.
- On the continuity of rotation representations in neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5745–5753, 2019.
- HoliCity: A city-scale data platform for learning holistic 3D structures. 2020. arXiv:2008.03286 [cs.CV].
- Single view metrology in the wild. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, European Conference on Computer Vision (ECCV), pages 316–333, Cham, 2020. Springer International Publishing.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.