TextSLAM: Visual SLAM with Semantic Planar Text Features (2305.10029v2)
Abstract: We propose a novel visual SLAM method that integrates text objects tightly by treating them as semantic features via fully exploring their geometric and semantic prior. The text object is modeled as a texture-rich planar patch whose semantic meaning is extracted and updated on the fly for better data association. With the full exploration of locally planar characteristics and semantic meaning of text objects, the SLAM system becomes more accurate and robust even under challenging conditions such as image blurring, large viewpoint changes, and significant illumination variations (day and night). We tested our method in various scenes with the ground truth data. The results show that integrating texture features leads to a more superior SLAM system that can match images across day and night. The reconstructed semantic 3D text map could be useful for navigation and scene understanding in robotic and mixed reality applications. Our project page: https://github.com/SJTU-ViSYS/TextSLAM .
- L. Heng, D. Honegger, G. H. Lee, L. Meier, P. Tanskanen, F. Fraundorfer, and M. Pollefeys, “Autonomous visual mapping and exploration with a micro aerial vehicle,” Journal of Field Robotics, vol. 31, no. 4, pp. 654–675, 2014.
- H. Lategahn, A. Geiger, and B. Kitt, “Visual slam for autonomous ground vehicles,” in IEEE International Conference on Robotics and Automation. IEEE, 2011, pp. 1732–1737.
- D. Chekhlov, A. P. Gee, A. Calway, and W. Mayol-Cuevas, “Ninja on a plane: Automatic discovery of physical planes for augmented reality using visual slam,” in IEEE and ACM International Symposium on Mixed and Augmented Reality. IEEE Computer Society, 2007, pp. 1–4.
- R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a versatile and accurate monocular slam system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
- A. J. Davison, “Real-time simultaneous localisation and mapping with a single camera,” in Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 2003, p. 1403.
- C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast semi-direct monocular visual odometry,” in IEEE International Conference on Robotics and Automation. IEEE, 2014, pp. 15–22.
- J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 3, pp. 611–625, 2018.
- H. Zhou, D. Zou, L. Pei, R. Ying, P. Liu, and W. Yu, “Structslam: Visual slam with building structure lines,” IEEE Transactions on Vehicular Technology, vol. 64, no. 4, pp. 1364–1375, 2015.
- A. J. Trevor, J. G. Rogers, and H. I. Christensen, “Planar surface slam with 3d and 2d sensors,” in IEEE International Conference on Robotics and Automation. IEEE, 2012, pp. 3041–3048.
- K.-N. Lianos, J. L. Schonberger, M. Pollefeys, and T. Sattler, “Vso: Visual semantic odometry,” in Proceedings of the European conference on computer vision, 2018, pp. 234–250.
- J. McCormac, A. Handa, A. Davison, and S. Leutenegger, “Semanticfusion: Dense 3d semantic mapping with convolutional neural networks,” in IEEE International Conference on Robotics and Automation. IEEE, 2017, pp. 4628–4635.
- A. Ranganathan, D. Ilstrup, and T. Wu, “Light-weight localization for vehicles using road markings,” in IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2013, pp. 921–927.
- N. Radwan, G. D. Tipaldi, L. Spinello, and W. Burgard, “Do you see the bakery? leveraging geo-referenced texts for global localization in public maps,” in IEEE International Conference on Robotics and Automation. IEEE, 2016, pp. 4837–4842.
- B. Li, D. Zou, D. Sartori, L. Pei, and W. Yu, “Textslam: Visual slam with planar text features,” in IEEE International Conference on Robotics and Automation. IEEE, 2020, pp. 2102–2108.
- Z. Hong, Y. Petillot, D. Lane, Y. Miao, and S. Wang, “Textplace: Visual place recognition and topological localization through reading scene texts,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 2861–2870.
- X. Rong, B. Li, J. P. Muñoz, J. Xiao, A. Arditi, and Y. Tian, “Guided text spotting for assistive blind navigation in unfamiliar indoor environments,” in International Symposium on Visual Computing. Springer, 2016, pp. 11–22.
- B. Li, J. P. Munoz, X. Rong, Q. Chen, J. Xiao, Y. Tian, A. Arditi, and M. Yousuf, “Vision-based mobile indoor assistive navigation aid for blind people,” IEEE Transactions on Mobile Computing, vol. 18, no. 3, pp. 702–714, 2019.
- H.-C. Wang, C. Finn, L. Paull, M. Kaess, R. Rosenholtz, S. Teller, and J. Leonard, “Bridging text spotting and slam with junction features,” in IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2015.
- J. Zhang, W. Wang, D. Huang, Q. Liu, and Y. Wang, “A feasible framework for arbitrary-shaped scene text recognition,” arXiv preprint, 2019.
- X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang, “East: an efficient and accurate scene text detector,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2017, pp. 2642–2651.
- P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai, “Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes,” in Proceedings of the European conference on computer vision, 2018, pp. 67–83.
- H. Wang, P. Lu, H. Zhang, M. Yang, X. Bai, Y. Xu, M. He, Y. Wang, and W. Liu, “All you need is boundary: Toward arbitrary-shaped text spotting,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 12 160–12 167.
- M. Liao, Z. Wan, C. Yao, K. Chen, and X. Bai, “Real-time scene text detection with differentiable binarization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 474–11 481.
- M. He, M. Liao, Z. Yang, H. Zhong, J. Tang, W. Cheng, C. Yao, Y. Wang, and X. Bai, “Most: A multi-oriented scene text detector with localization refinement,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2021.
- Y. Zhu, C. Yao, and X. Bai, “Scene text detection and recognition: Recent advances and future trends,” Frontiers of Computer Science, vol. 10, no. 1, pp. 19–36, 2016.
- X.-C. Yin, Z.-Y. Zuo, S. Tian, and C.-L. Liu, “Text detection, tracking and recognition in video: a comprehensive survey,” IEEE Transactions on Image Processing, vol. 25, no. 6, pp. 2752–2773, 2016.
- A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie, “Coco-text: Dataset and benchmark for text detection and recognition in natural images,” arXiv preprint, 2016.
- M. Iwamura, T. Matsuda, N. Morimoto, H. Sato, Y. Ikeda, and K. Kise, “Downtown osaka scene text dataset,” in European Conference on Computer Vision. Springer, 2016, pp. 440–455.
- C. K. Chng, Y. Liu, Y. Sun, C. C. Ng, C. Luo, Z. Ni, C. Fang, S. Zhang, J. Han, E. Ding et al., “Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art,” in International Conference on Document Analysis and Recognition. IEEE, 2019, pp. 1571–1576.
- M. Tomono and S. Yuta, “Mobile robot navigation in indoor environments using object and character recognition,” in IEEE International Conference on Robotics and Automation, vol. 1. IEEE, 2000, pp. 313–320.
- M. Mata, J. M. Armingol, A. de la Escalera, and M. A. Salichs, “A visual landmark recognition system for topological navigation of mobile robots,” in IEEE International Conference on Robotics and Automation, vol. 2. IEEE, 2001, pp. 1124–1129.
- C. Case, B. Suresh, A. Coates, and A. Y. Ng, “Autonomous sign reading for semantic mapping,” in IEEE International Conference on Robotics and Automation. IEEE, 2011, pp. 3297–3303.
- S. Wang, S. Fidler, and R. Urtasun, “Lost shopping! monocular localization in large indoor spaces,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2695–2703.
- A. P. Gee, D. Chekhlov, W. W. Mayol-Cuevas, and A. Calway, “Discovering planes and collapsing the state space in visual slam.” in British Machine Vision Conference, 2007, pp. 1–10.
- A. P. Gee, D. Chekhlov, A. Calway, and W. Mayol-Cuevas, “Discovering higher level structure in visual slam,” IEEE Transactions on Robotics, vol. 24, no. 5, pp. 980–990, 2008.
- M. Y. Yang and W. Förstner, “Plane detection in point cloud data,” in Proceedings of the international conference on machine control guidance, vol. 1, 2010, pp. 95–104.
- A. J. Davison and J. Ortiz, “Futuremapping 2: Gaussian belief propagation for spatial ai,” arXiv preprint, 2019.
- J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” in IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 573–580.
- L. Ma, C. Kerl, J. Stückler, and D. Cremers, “Cpa-slam: Consistent plane-model alignment for direct rgb-d slam,” in IEEE International Conference on Robotics and Automation. IEEE, 2016, pp. 1285–1291.
- P. Kim, B. Coltin, and H. Jin Kim, “Linear rgb-d slam for planar environments,” in Proceedings of the European conference on computer vision, 2018, pp. 333–348.
- N. Molton, A. J. Davison, and I. D. Reid, “Locally planar patch features for real-time structure from motion.” in British Machine Vision Conference. Citeseer, 2004, pp. 1–10.
- M. Sualeh and G.-W. Kim, “Simultaneous localization and mapping in the epoch of semantics: a survey,” International Journal of Control, Automation and Systems, vol. 17, no. 3, pp. 729–742, 2019.
- A. J. Davison, “Futuremapping: The computational structure of spatial ai systems,” arXiv preprint, 2018.
- A. Rosinol, M. Abate, Y. Chang, and L. Carlone, “Kimera: an open-source library for real-time metric-semantic localization and mapping,” in IEEE International Conference on Robotics and Automation. IEEE, 2020, pp. 1689–1696.
- Y. Chang, Y. Tian, J. P. How, and L. Carlone, “Kimera-multi: a system for distributed multi-robot metric-semantic simultaneous localization and mapping,” in IEEE International Conference on Robotics and Automation. IEEE, 2021, pp. 11 210–11 218.
- M. Runz, M. Buffier, and L. Agapito, “Maskfusion: Real-time recognition, tracking and reconstruction of multiple moving objects,” in IEEE and ACM International Symposium on Mixed and Augmented Reality. IEEE, 2018, pp. 10–20.
- M. Grinvald, F. Furrer, T. Novkovic, J. J. Chung, C. Cadena, R. Siegwart, and J. Nieto, “Volumetric instance-aware semantic mapping and 3d object discovery,” IEEE Robotics and Automation Letters, vol. 4, no. 3, pp. 3037–3044, 2019.
- S. Zhi, M. Bloesch, S. Leutenegger, and A. J. Davison, “Scenecode: Monocular dense semantic reconstruction using learned encoded scene representations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 776–11 785.
- D. Gálvez-López, M. Salas, J. D. Tardós, and J. Montiel, “Real-time monocular object slam,” Robotics and Autonomous Systems, vol. 75, pp. 435–449, 2016.
- R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J. Davison, “Slam++: Simultaneous localisation and mapping at the level of objects,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 1352–1359.
- J. McCormac, R. Clark, M. Bloesch, A. Davison, and S. Leutenegger, “Fusion++: Volumetric object-level slam,” in International Conference on 3D Vision. IEEE, 2018, pp. 32–41.
- T. Laidlow and A. J. Davison, “Simultaneous localisation and mapping with quadric surfaces,” arXiv preprint, 2022.
- K. Mazur, E. Sucar, and A. J. Davison, “Feature-realistic neural fusion for real-time, open set scene understanding,” arXiv preprint, 2022.
- B. Xu, A. J. Davison, and S. Leutenegger, “Learning to complete object shapes for object-level mapping in dynamic scenes,” IEEE/RSJ International Conference on Intelligent Robots and Systems, 2022.
- S. Yang and S. Scherer, “Cubeslam: Monocular 3-d object slam,” IEEE Transactions on Robotics, vol. 35, no. 4, pp. 925–938, 2019.
- S. Yang and S. Schere, “Monocular object and plane slam in structured environments,” IEEE Robotics and Automation Letters, 2019.
- J. Dong, X. Fei, and S. Soatto, “Visual-inertial-semantic scene representation for 3d object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 960–970.
- L. Nicholson, M. Milford, and N. Sünderhauf, “Quadricslam: Dual quadrics from object detections as landmarks in object-oriented slam,” IEEE Robotics and Automation Letters, vol. 4, no. 1, pp. 1–8, 2018.
- D. Létourneau, F. Michaud, and J.-M. Valin, “Autonomous mobile robot that can read,” EURASIP Journal on Advances in Signal Processing, vol. 2004, no. 17, pp. 1–13, 2004.
- X. Liu and J. Samarabandu, “An edge-based text region extraction algorithm for indoor mobile robot navigation,” in IEEE International Conference Mechatronics and Automation, vol. 2. IEEE, 2005, pp. 701–706.
- E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 2011, pp. 2564–2571.
- J. Shi and C. Tomasi, “Good features to track,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1994.
- J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 3, pp. 611–625, 2017.
- R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
- D. Gálvez-López and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188–1197, 2012.
- V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady, vol. 10, no. 8. Soviet Union, 1966, pp. 707–710.
- M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “Brief: Binary robust independent elementary features,” in Proceedings of the European conference on computer vision. Springer, 2010, pp. 778–792.
- M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
- D. Zou, Y. Wu, L. Pei, H. Ling, and W. Yu, “Structvio: Visual-inertial odometry with structural regularity of man-made environments,” IEEE Transactions on Robotics, vol. 35, no. 4, pp. 999–1013, Aug 2019.
- T. Sattler, A. Torii, J. Sivic, M. Pollefeys, H. Taira, M. Okutomi, and T. Pajdla, “Are large-scale 3d models really necessary for accurate visual localization?” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, July 2017.
- T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic, F. Kahl, and T. Pajdla, “Benchmarking 6dof outdoor visual localization in changing conditions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2018.
- J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4104–4113.
- C. V. R. G. at Chalmers University of Technology, “Long-term visual localization benchmark,” 2019, https://www.visuallocalization.net/benchmark/.
- R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2016.
- D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-supervised interest point detection and description,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition workshops, 2018, pp. 224–236.
- P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 4938–4947.
- Q. Zhou, T. Sattler, and L. Leal-Taixe, “Patch2pix: Epipolar-guided pixel-level correspondences,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 4669–4678.
- P. C. Ng and S. Henikoff, “Sift: Predicting amino acid changes that affect protein function,” Nucleic acids research, vol. 31, no. 13, pp. 3812–3814, 2003.
- Boying Li (6 papers)
- Danping Zou (23 papers)
- Yuan Huang (85 papers)
- Xinghan Niu (1 paper)
- Ling Pei (36 papers)
- Wenxian Yu (36 papers)