EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera Relocalization (2402.13537v1)
Abstract: Camera relocalization is pivotal in computer vision, with applications in AR, drones, robotics, and autonomous driving. It estimates 3D camera position and orientation (6-DoF) from images. Unlike traditional methods like SLAM, recent strides use deep learning for direct end-to-end pose estimation. We propose EffLoc, a novel efficient Vision Transformer for single-image camera relocalization. EffLoc's hierarchical layout, memory-bound self-attention, and feed-forward layers boost memory efficiency and inter-channel communication. Our introduced sequential group attention (SGA) module enhances computational efficiency by diversifying input features, reducing redundancy, and expanding model capacity. EffLoc excels in efficiency and accuracy, outperforming prior methods, such as AtLoc and MapNet. It thrives on large-scale outdoor car-driving scenario, ensuring simplicity, end-to-end trainability, and eliminating handcrafted loss functions.
- R. Castle, G. Klein, and D. W. Murray, “Video-rate localization in multiple maps for wearable augmented reality,” 2008 12th IEEE International Symposium on Wearable Computers, Pittsburgh, PA, USA, 2008, pp. 15-22.
- C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. M. Montiel and J. D. Tardós, “ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM.” In IEEE Transactions on Robotics, vol. 37, no. 6, Dec. 2021, pp. 1874-1890.
- R. Elvira, J. D. Tardós and J. M. M. Montiel, “ORBSLAM-Atlas: a robust and accurate multi-map system,” 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 2019, pp. 6253-6259.
- Wang, S., Kang, Q., She, R., Tay, W. P., Hartmannsgruber, A., & Navarro Navarro, D. “RobustLoc: Robust Camera Pose Regression in Challenging Driving Environments.” Proceedings of the AAAI Conference on Artificial Intelligence, 2023, pp. 6209-6216.
- A. Kendall, M. Grimes and R. Cipolla, “PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization,” 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 2938-2946.
- S. Wang, R. Clark, H. Wen and N. Trigoni, “DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks,” 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 2017, pp. 2043-2050.
- Y. Shavit, R. Ferens and Y. Keller, “Learning Multi-Scene Absolute Pose Regression with Transformers,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Canada, 2021, pp. 2713-2722.
- Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. “Data movement is all you need: A case study on optimizing transformers.” In Proceedings of the Fourth Conference on Machine Learning and Systems (MLSys), 2021.
- Michel P, Levy O, Neubig G. “Are Sixteen Heads Really Better than One?” Advances in Neural Information Processing Systems(NIPS), 2019.
- Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell, “Rethinking the Value of Network Pruning,” International Conference on Learning Representations(ICLR), 2019.
- L. Liu, H. Li, and Y. Dai, “Efficient Global 2D-3D Matching for Camera Localization in a Large-Scale 3D Map.” In 2017 IEEE International Conference on Computer Vision (ICCV), Oct. 2017, pp. 2391–2400.
- A. Kendall and R. Cipolla, “Modelling uncertainty in deep learning for camera relocalization,” 2016 IEEE International Conference on Robotics and Automation (ICRA), Sweden, 2016, pp. 4762-4769
- A. Kendall and R. Cipolla, “Geometric Loss Functions for Camera Pose Regression with Deep Learning,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 6555-6564.
- T. Sattler, Q. Zhou, M. Pollefeys and L. Leal-Taixé, “Understanding the Limitations of CNN-Based Absolute Camera Pose Regression,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 3297-3307.
- F. Walch, C. Hazirbas, L. Leal-Taixé, T. Sattler, S. Hilsenbeck and D. Cremers, “Image-Based Localization Using LSTMs for Structured Feature Correlation,” 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017, pp. 627-637.
- K. Shridhar, F. Laumann, and M. Liwicki, “A comprehensive guide to bayesian convolutional neural network with variational inference.” arXiv preprint arXiv:1901.02731, 2019
- Wang, J., Qi, Y. “Deep 6-DoF camera relocalization in variable and dynamic scenes by multitask learning.” Machine Vision and Applications, pp. 34-37, 2023.
- Shavit, Y., Keller, Y. “Camera Pose Auto-encoders for Improving Pose Regression.” In Proceedings of the European conference on computer vision (ECCV), vol 13670, 2022.
- Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A. Jegou, H.. “Training data-efficient image transformers & distillation through attention.” Proceedings of the 38th International Conference on Machine Learning (PMLR), pp. 10347-10357, 2021.
- “Proceedings of IEEE International Conference on Computer Vision.” In Proceedings of IEEE International Conference on Computer Vision (ICCV), Jun, 1995.
- M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks.” In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4510–4520.
- Kenton, J. D. M. W. C., & Toutanova, L. K. “Bert: Pre-training of deep bidirectional transformers for language understanding.” In Proceedings of naacL-HLT, Vol. 1, p. 2, June, 2019
- Mehta S, Rastegari M. “Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer.” International Conference on Learning Representations(ICLR), 2022
- S. Mehta and M. Rastegari, “Separable Self-attention for Mobile Vision Transformers.” In Transactions on Machine Learning Research (TMLR), 2023
- S. N. Wadekar and A. Chaurasia, “MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features.” International Conference on Learning Representations(ICLR), 2023
- Zhang, H., Hu, W., Wang, X. “ParC-Net: Position Aware Circular Convolution with Merits from ConvNets and Transformer.” In Proceedings of the European Conference on Computer Vision (ECCV), November, 2022.
- T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollár, and R. Girshick, “Early Convolutions Help Transformers See Better.” Advances in Neural Information Processing Systems(NIPS), 2021.
- I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization.” International Conference on Learning Representations(ICLR), 2019.
- W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 year, 1000 km: The Oxford RobotCar dataset,” Int. J. Robot. Res., vol. 36, no. 1, pp. 3–15, Jan, 2017.
- D. Cattaneo, M. Vaghi, A. L. Ballardini, S. Fontana, D. G. Sorrenti and W. Burgard, “CMRNet: Camera to LiDAR-Map Registration,” 2019 IEEE Intelligent Transportation Systems Conference (ITSC), New Zealand, pp. 1283-1289, 2019
- D. Cattaneo, D. G. Sorrenti and A. Valada, “CMRNet: Camera to LiDAR-Map Registration,” 2020 IEEE International Conference on Robotics and Automation (ICRA) Workshop on Emerging Learning and Algorithmic Methods for Data Association in Robotics, France, 2020