Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization (2404.19652v4)

Published 30 Apr 2024 in cs.CV and cs.AI

Abstract: Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce a new method, termed VimTS, which enhances the generalization ability of the model by achieving better synergy among different tasks. Typically, we propose a Prompt Queries Generation Module and a Tasks-aware Adapter to effectively convert the original single-task model into a multi-task model suitable for both image and video scenarios with minimal additional parameters. The Prompt Queries Generation Module facilitates explicit interaction between different tasks, while the Tasks-aware Adapter helps the model dynamically learn suitable features for each task. Additionally, to further enable the model to learn temporal information at a lower cost, we propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm. Notably, our method outperforms the state-of-the-art method by an average of 2.6% in six cross-domain benchmarks such as TT-to-IC15, CTW1500-to-TT, and TT-to-CTW1500. For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2 by an average of 5.5% on the MOTA metric, using only image-level data. We further demonstrate that existing Large Multimodal Models exhibit limitations in generating cross-domain scene text spotting, in contrast to our VimTS model which requires significantly fewer parameters and data. The code and datasets will be made available at the https://VimTextSpotter.github.io.

Understanding VimTS: Enhancing Cross-Domain Generalization in Text Spotting

Introduction

In the evolving landscape of text spotting technologies, particularly for applications such as automated subtitling, reading road signs, and real-time translation, the challenge of effectively processing text across various domains remains significant. Traditional models often perform well within the domains they are trained on but falter when applied to new, unseen datasets or formats.

A novel approach presented in the text spotting domain is the VimTS (Video and Image Text Spotter), which aims to address these challenges by improving model generalization across different domains, such as transitioning from static images to dynamic video inputs.

Core Contributions of VimTS

The main advancements brought by VimTS can be categorized into the following:

  1. Unified Multi-task Architecture: VimTS introduces a sophisticated architecture that integrates detection, recognition, and tracking into a single framework. This unification allows the model to leverage commonalities between these tasks, enhancing performance and efficiency.
  2. Prompt Queries Generation Module (PQGM) and Task-aware Adapter: These components are crucial for the model's adaptability, allowing it to dynamically switch between tasks like detecting word-level or line-level text and adapting from static images to videos. The PQGM helps in generating context-specific queries which are essential for the model to focus on the relevant task, while the Task-aware Adapter optimizes feature selection across different tasks with minimal parameter overhead.
  3. Synthetic Video Text Dataset (VTD-368k): VimTS incorporates a novel dataset created using a technique called Content Deformation Fields (CoDeF). This dataset is specifically designed to train the model on video data without the extensive costs typically associated with video annotation.

Empirical Performance

VimTS has shown remarkable performance improvements over existing state-of-the-art models. Specifically:

  • On static image benchmarks, it improves by an average of 2.6% in H-mean score across six different benchmarks.
  • In video-level adaptations, VimTS outperforms prior video text spotters by an average of 5.5% on the MOTA metric.

These results are indicative not only of the model's robustness but also of its generalization capability across diverse text spotting scenarios.

Practical Implications

The improvements VimTS brings are beneficial for a range of real-world applications:

  • Automotive and Navigation Systems: Enhanced text spotting can lead to better recognition of road signs and navigation aids in real-time.
  • Surveillance and Security: Accurate text spotting in video feeds can be crucial for security and monitoring applications.
  • Media and Entertainment: From automated subtitling to more immersive augmented reality experiences, VimTS could significantly enhance media consumption technologies.

Future Directions

While VimTS presents a significant step forward, several areas could be explored further:

  • Reduction in Computational Overhead: While the Task-aware Adapter reduces parameter needs, exploring more efficient architectures could further enhance deployment on edge devices.
  • Robustness to Environmental Variants: Text spotting in adverse weather conditions or in poorly lit environments remains challenging and could be an area of future enhancement.

Conclusion

VimTS sets a new benchmark for cross-domain text spotting with its innovative architecture and synthetic training dataset. By effectively bridging the gap between static image and video text spotting, and between different text formats, it opens new avenues for research and application in automated text recognition technologies. As with all AI models, continuous refinement and adaptation will be key to maintaining relevance as new challenges and datasets emerge.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (98)
  1. X. Liu, D. Liang, S. Yan, D. Chen, Y. Qiao, and J. Yan, “FOTS: Fast oriented text spotting with a unified network,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 5676–5685, 2018.
  2. M. Liao, G. Pang, J. Huang, T. Hassner, and X. Bai, “Mask TextSpotter v3: Segmentation proposal network for robust scene text spotting,” in Proc. Eur. Conf. Comp. Vis., pp. 706–722, 2020.
  3. Y. Liu, C. Shen, L. Jin, T. He, P. Chen, C. Liu, and H. Chen, “ABCNet v2: Adaptive bezier-curve network for real-time end-to-end text spotting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 11, pp. 8048–8064, 2022.
  4. X. Zhang, Y. Su, S. Tripathi, and Z. Tu, “Text spotting transformers,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 9519–9528, 2022.
  5. Y. Kittenplon, I. Lavi, S. Fogel, Y. Bar, R. Manmatha, and P. Perona, “Towards weakly-supervised text spotting using a multi-task transformer,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 4604–4613, 2022.
  6. D. Peng, X. Wang, Y. Liu, J. Zhang, M. Huang, S. Lai, J. Li, S. Zhu, D. Lin, C. Shen, et al., “SPTS: Single-point text spotting,” in Proc. ACM Int. Conf. Multimedia, pp. 4272–4281, 2022.
  7. W. Yu, Y. Liu, X. Zhu, H. Cao, X. Sun, and X. Bai, “Turning a clip model into a scene text spotter,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–12, 2024.
  8. D. Karatzas, L. Gomez-Bigorda, et al., “ICDAR 2015 competition on robust reading,” in Proc. IAPR Int. Conf. Document Analysis Recog., pp. 1156–1160, 2015.
  9. D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. De Las Heras, “ICDAR 2013 robust reading competition,” in Proc. IAPR Int. Conf. Document Analysis Recog., pp. 1484–1493, 2013.
  10. C.-K. Ch’ng, C. S. Chan, and C.-L. Liu, “Total-Text: toward orientation robustness in scene text detection,” Int. J. Document Analysis Recogn., pp. 1–22, 2019.
  11. Y. Liu, L. Jin, S. Zhang, C. Luo, and S. Zhang, “Curved scene text detection via transverse and longitudinal sequence connection,” Pattern Recogn., vol. 90, pp. 337–345, 2019.
  12. Y. Zhao, W. Wu, Z. Li, J. Li, and W. Wang, “Flowtext: Synthesizing realistic scene text video with optical flow estimation,” in IEEE Int Conf. on Multimedia and Expo, pp. 1517–1522, 2023.
  13. A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text localisation in natural images,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2315–2324, 2016.
  14. M. Huang, J. Zhang, D. Peng, H. Lu, C. Huang, Y. Liu, X. Bai, and L. Jin, “ESTextSpotter: Towards better scene text spotting with explicit synergy in transformer,” in Proc. IEEE Int. Conf. Comp. Vis., pp. 19495–19505, 2023.
  15. H. Ouyang, Q. Wang, Y. Xiao, Q. Bai, J. Zhang, K. Zheng, X. Zhou, Q. Chen, and Y. Shen, “Codef: Content deformation fields for temporally consistent video processing,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2024.
  16. M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Reading text in the wild with convolutional neural networks,” Int. J. Comput. Vision, vol. 116, no. 6, pp. 1–20, 2016.
  17. H. Li, P. Wang, and C. Shen, “Towards end-to-end text spotting with convolutional recurrent neural networks,” in Proc. IEEE Int. Conf. Comp. Vis., pp. 5238–5246, 2017.
  18. T. He, Z. Tian, W. Huang, C. Shen, Y. Qiao, and C. Sun, “An end-to-end textspotter with explicit alignment and attention,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 5020–5029, 2018.
  19. M. Liao, P. Lyu, M. He, C. Yao, W. Wu, and X. Bai, “Mask TextSpotter: An end-to-end trainable neural network for spotting text with arbitrary shapes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, pp. 532–548, 2019.
  20. W. Feng, W. He, F. Yin, X.-Y. Zhang, and C.-L. Liu, “TextDragon: An end-to-end framework for arbitrary shaped text spotting,” in Proc. IEEE Int. Conf. Comp. Vis., pp. 9076–9085, 2019.
  21. S. Qin, A. Bissacco, M. Raptis, Y. Fujii, and Y. Xiao, “Towards unconstrained end-to-end text spotting,” Proc. IEEE Int. Conf. Comp. Vis., pp. 4704–4714, 2019.
  22. W. Wang, E. Xie, X. Li, X. Liu, D. Liang, Y. Zhibo, T. Lu, and C. Shen, “PAN++: Towards efficient and accurate end-to-end spotting of arbitrarily-shaped text,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 9, pp. 5349–5367, 2022.
  23. W. Wang, E. Xie, X. Song, Y. Zang, W. Wang, T. Lu, G. Yu, and C. Shen, “Efficient and accurate arbitrary-shaped text detection with pixel aggregation network,” Proc. IEEE Int. Conf. Comp. Vis., pp. 8440–8449, 2019.
  24. Y. Liu, H. Chen, C. Shen, T. He, L. Jin, and L. Wang, “ABCNet: Real-time scene text spotting with adaptive bezier-curve network,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 9809–9818, 2020.
  25. S. Fang, Z. Mao, H. Xie, Y. Wang, C. Yan, and Y. Zhang, “ABINet++: Autonomous, bidirectional and iterative language modeling for scene text spotting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 11, no. 6, pp. 7123 – 7141, 2023.
  26. S. Fang, H. Xie, Y. Wang, Z. Mao, and Y. Zhang, “Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 7098–7107, 2021.
  27. X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable transformers for end-to-end object detection,” Proc. Int. Conf. Learn. Representations, 2021.
  28. M. Huang, Y. Liu, Z. Peng, C. Liu, D. Lin, S. Zhu, N. Yuan, K. Ding, and L. Jin, “SwinTextSpotter: Scene text spotting via better synergy between text detection and text recognition,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 4593–4603, 2022.
  29. Y. Liu, J. Zhang, D. Peng, M. Huang, X. Wang, J. Tang, C. Huang, D. Lin, C. Shen, X. Bai, et al., “SPTS v2: Single-point scene text spotting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 12, pp. 15665–15679, 2023.
  30. M. Ye, J. Zhang, S. Zhao, J. Liu, T. Liu, B. Du, and D. Tao, “Deepsolo: Let transformer decoder with explicit points solo for text spotting,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 19348–19357, 2023.
  31. X.-C. Yin, Z.-Y. Zuo, S. Tian, and C.-L. Liu, “Text detection, tracking and recognition in video: a comprehensive survey,” IEEE Trans. Image Process., vol. 25, no. 6, pp. 2752–2773, 2016.
  32. X. Wang, Y. Jiang, S. Yang, X. Zhu, W. Li, P. Fu, H. Wang, and Z. Luo, “End-to-end scene text recognition in videos based on multi frame tracking,” in Proc. IAPR Int. Conf. Document Analysis Recog., vol. 1, pp. 1255–1260, 2017.
  33. Z. Cheng, J. Lu, Y. Niu, S. Pu, F. Wu, and S. Zhou, “You only recognize once: Towards fast video text spotting,” in Proc. ACM Int. Conf. Multimedia, pp. 855–863, 2019.
  34. Z. Cheng, J. Lu, B. Zou, L. Qiao, Y. Xu, S. Pu, Y. Niu, F. Wu, and S. Zhou, “Free: A fast and robust end-to-end video text spotter,” IEEE Trans. Image Process., vol. 30, pp. 822–837, 2020.
  35. P. X. Nguyen, K. Wang, and S. Belongie, “Video text detection and recognition: Dataset and benchmark,” in Proc. Winter Conf. Appl. Comp. Vision, pp. 776–783, 2014.
  36. X. Rong, C. Yi, X. Yang, and Y. Tian, “Scene text recognition in multiple frames based on text tracking,” in IEEE Int Conf. on Multimedia and Expo, pp. 1–6, 2014.
  37. W. Wu, D. Zhang, Y. Cai, S. Wang, J. Li, Z. Li, Y. Tang, and H. Zhou, “A bilingual, openworld video text dataset and end-to-end video text spotter with transformer,” In Proc. Advances in Neural Inf. Process. Syst. Track on Datasets and Benchmarks, pp. 1–10, 2021.
  38. F. Zeng, B. Dong, Y. Zhang, T. Wang, X. Zhang, and Y. Wei, “MOTR: End-to-end multiple-object tracking with transformer,” in Proc. Eur. Conf. Comp. Vis., pp. 659–675, 2022.
  39. W. Wu, D. Zhang, Y. Fu, C. Shen, H. Zhou, Y. Cai, and P. Luo, “End-to-end video text spotting with transformer,” Int. J. Comput. Vision, pp. 1–11, 2024.
  40. X. Zu, H. Yu, B. Li, and X. Xue, “Towards accurate video text spotting with text-wise semantic reasoning,” in Proc. Int. Joint Conf. Artificial Intell., pp. 1858–1866, 2023.
  41. E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 7167–7176, 2017.
  42. K. You, M. Long, Z. Cao, J. Wang, and M. I. Jordan, “Universal domain adaptation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2720–2729, 2019.
  43. C. Ge, R. Huang, M. Xie, Z. Lai, S. Song, S. Li, and G. Huang, “Domain adaptation via prompt learning,” IEEE Trans. Neural Netw. & Learn. Syst., pp. 1–11, 2023.
  44. D. Chen, L. Lu, Y. Lu, R. Yu, S. Wang, L. Zhang, and T. Liu, “Cross-domain scene text detection via pixel and image-level adaptation,” in Int. Conf. Neural Inf. Process., pp. 135–143, 2019.
  45. W. Wu, N. Lu, E. Xie, Y. Wang, W. Yu, C. Yang, and H. Zhou, “Synthetic-to-real unsupervised domain adaptation for scene text detection in the wild,” in Proc. Asian Conf. Comp. Vis., 2020.
  46. Y. Chen, W. Wang, Y. Zhou, F. Yang, D. Yang, and W. Wang, “Self-training for domain adaptive scene text detection,” in Proc. IAPR Int. Conf. Document Analysis Recog., pp. 850–857, 2021.
  47. W. Yu, Y. Liu, W. Hua, D. Jiang, B. Ren, and X. Bai, “Turning a clip model into a scene text detector,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 6978–6988, 2023.
  48. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in In Proc. Int. Conf. Mach. Learn., pp. 8748–8763, PMLR, 2021.
  49. Y. Zhang, S. Nie, W. Liu, X. Xu, D. Zhang, and H. T. Shen, “Sequence-to-sequence domain adaptation network for robust text image recognition,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2740–2749, 2019.
  50. F. Zhan, C. Xue, and S. Lu, “Ga-dan: Geometry-aware domain adaptation network for scene text detection and recognition,” in Proc. IEEE Int. Conf. Comp. Vis., pp. 9105–9115, 2019.
  51. W. He, X.-Y. Zhang, F. Yin, and C.-L. Liu, “Deep direct regression for multi-oriented scene text detection,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 745–753, 2017.
  52. N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” in Proc. Int. Conf. Mach. Learn., pp. 2790–2799, 2019.
  53. Y. Liu, J. Wu, and Y. Fu, “Collaborative tracking learning for frame-rate-insensitive multi-object tracking,” in Proc. IEEE Int. Conf. Comp. Vis., pp. 9964–9973, 2023.
  54. J. Xiao, X. Shang, A. Yao, and T.-S. Chua, “Next-qa: Next phase of question-answering to explaining temporal actions,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 9777–9786, 2021.
  55. C. Xu, S.-H. Hsieh, C. Xiong, and J. J. Corso, “Can humans fly? action understanding with multiple classes of actors,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2264–2273, 2015.
  56. Y. Zhang, H. Doughty, L. Shao, and C. G. Snoek, “Audio-adaptive activity recognition across video domains,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 13791–13800, 2022.
  57. G. A. Sigurdsson, A. Gupta, C. Schmid, A. Farhadi, and K. Alahari, “Actor and observer: Joint modeling of first and third-person videos,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 7396–7404, 2018.
  58. M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele, “A database for fine grained activity detection of cooking activities,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 1194–1201, 2012.
  59. M. Marszalek, I. Laptev, and C. Schmid, “Actions in context,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2929–2936, 2009.
  60. H. Kuehne, A. Arslan, and T. Serre, “The language of actions: Recovering the syntax and semantics of goal-directed human activities,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 780–787, 2014.
  61. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proc. Eur. Conf. Comp. Vis., pp. 213–229, 2020.
  62. S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang, “DAB-DETR: Dynamic anchor boxes are better queries for DETR,” in Int. Conf. Learn. Representations, 2022.
  63. H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
  64. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proc. IEEE Int. Conf. Comp. Vis., pp. 2980–2988, 2017.
  65. H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 658–666, 2019.
  66. L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proc. IEEE Int. Conf. Comp. Vis., pp. 3836–3847, 2023.
  67. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., “Segment anything,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015–4026, 2023.
  68. X. Wang, L. Xie, C. Dong, and Y. Shan, “Real-esrgan: Training real-world blind super-resolution with pure synthetic data,” in Proc. IEEE Int. Conf. Comp. Vis., pp. 1905–1914, 2021.
  69. Y. Cheng, L. Li, Y. Xu, X. Li, Z. Yang, W. Wang, and Y. Yang, “Segment and track anything,” arXiv preprint arXiv:2305.06558, 2023.
  70. Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” in Proc. Eur. Conf. Comp. Vis., pp. 402–419, 2020.
  71. M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
  72. W. Wang, Y. Zhou, J. Lv, D. Wu, G. Zhao, N. Jiang, and W. Wang, “Tpsnet: Reverse thinking of thin plate splines for arbitrary shape scene text representation,” in Proc. ACM Int. Conf. Multimedia, pp. 5014–5025, 2022.
  73. M. Ye, J. Zhang, S. Zhao, J. Liu, B. Du, and D. Tao, “DPText-DETR: Towards better scene text detection with dynamic points in transformer,” in Proc. AAAI Conf. Artificial Intell., pp. 3241–3249, 2023.
  74. N. Nayef, F. Yin, I. Bizid, H. Choi, Y. Feng, D. Karatzas, Z. Luo, U. Pal, C. Rigaud, J. Chazalon, et al., “ICDAR 2017 robust reading challenge on multi-lingual scene text detection and script identification-RRC-MLT,” in Proc. IAPR Int. Conf. Document Analysis Recog., vol. 1, pp. 1454–1459, 2017.
  75. B. Epshtein, E. Ofek, and Y. Wexler, “Detecting text in natural scenes with stroke width transform,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2963–2970, 2010.
  76. X. Zhao, K.-H. Lin, Y. Fu, Y. Hu, Y. Liu, and T. S. Huang, “Text from corners: a novel approach to detect text and caption in videos,” IEEE Trans. Image Process., vol. 20, no. 3, pp. 790–799, 2010.
  77. X.-C. Yin, X. Yin, K. Huang, and H.-W. Hao, “Robust text detection in natural scene images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 5, pp. 970–983, 2013.
  78. V. Khare, P. Shivakumara, R. Paramesran, and M. Blumenstein, “Arbitrarily-oriented multi-lingual text detection in video,” Multimedia Tools and Applications, vol. 76, pp. 16625–16655, 2017.
  79. L. Wang, Y. Wang, S. Shan, and F. Su, “Scene text detection and tracking in video with background cues,” in Proc. ACM on Int. Conf. on Multimedia Retrieval, pp. 160–168, 2018.
  80. P. Shivakumara, L. Wu, T. Lu, C. L. Tan, M. Blumenstein, and B. S. Anami, “Fractals based multi-oriented text detection system for recognition in mobile video images,” Pattern Recogn., vol. 68, pp. 158–174, 2017.
  81. L. Wu, P. Shivakumara, T. Lu, and C. L. Tan, “A new technique for multi-oriented scene text line detection and tracking in video,” IEEE Transactions on multimedia, vol. 17, no. 8, pp. 1137–1152, 2015.
  82. H. Yu, Y. Huang, L. Pi, C. Zhang, X. Li, and L. Wang, “End-to-end video text detection with online tracking,” Pattern Recogn., vol. 113, p. 107791, 2021.
  83. W. Feng, F. Yin, X.-Y. Zhang, and C.-L. Liu, “Semantic-aware video text detection,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 1695–1705, 2021.
  84. X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang, “EAST: An efficient and accurate scene text detector,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 5551–5560, 2017.
  85. X. Wang, Y. Jiang, Z. Luo, C.-L. Liu, H. Choi, and S. Kim, “Arbitrary Shape Scene Text Detection with Adaptive Text Region Representation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 6449–6458, 2019.
  86. M. Liao, Z. Wan, C. Yao, K. Chen, and X. Bai, “Real-time scene text detection with differentiable binarization,” in Proc. AAAI Conf. Artificial Intell., pp. 11474–11481, 2020.
  87. M. Liao, Z. Zou, Z. Wan, C. Yao, and X. Bai, “Real-time scene text detection with differentiable binarization and adaptive scale fusion,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 919–931, 2022.
  88. W. Wu, Y. Zhang, Y. He, L. Zhang, Z. Lou, H. Zhou, and X. Bai, “DSText V2: A comprehensive video text spotting dataset for dense and small text,” Pattern Recogn., vol. 149, p. 110177, 2024.
  89. X. Zhou, V. Koltun, and P. Krähenbühl, “Tracking objects as points,” in Proc. Eur. Conf. Comp. Vis., pp. 474–490, Springer, 2020.
  90. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” Proc. Int. Conf. Learn. Representations, 2018.
  91. Y. Shi, D. Peng, W. Liao, Z. Lin, X. Chen, C. Liu, Y. Zhang, and L. Jin, “Exploring ocr capabilities of gpt-4v (ision): A quantitative and in-depth evaluation,” arXiv preprint arXiv:2310.16809, 2023.
  92. J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,” arXiv preprint arXiv:2308.12966, 2023.
  93. H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved baselines with visual instruction tuning,” arXiv preprint arXiv:2310.03744, 2023.
  94. Q. Ye, H. Xu, J. Ye, M. Yan, H. Liu, Q. Qian, J. Zhang, F. Huang, and J. Zhou, “mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration,” arXiv preprint arXiv:2311.04257, 2023.
  95. Z. Li, B. Yang, Q. Liu, Z. Ma, S. Zhang, J. Yang, Y. Sun, Y. Liu, and X. Bai, “Monkey: Image resolution and text label are important things for large multi-modal models,” arXiv preprint arXiv:2311.06607, 2023.
  96. Y. Liu, B. Yang, Q. Liu, Z. Li, Z. Ma, S. Zhang, and X. Bai, “Textmonkey: An ocr-free large multimodal model for understanding document,” arXiv preprint arXiv:2403.04473, 2024.
  97. X. Dong, P. Zhang, Y. Zang, Y. Cao, B. Wang, L. Ouyang, X. Wei, S. Zhang, H. Duan, M. Cao, W. Zhang, Y. Li, H. Yan, Y. Gao, X. Zhang, W. Li, J. Li, K. Chen, C. He, X. Zhang, Y. Qiao, D. Lin, and J. Wang, “Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model,” arXiv preprint arXiv:2401.16420, 2024.
  98. G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yuliang Liu (82 papers)
  2. Mingxin Huang (14 papers)
  3. Hao Yan (109 papers)
  4. Linger Deng (3 papers)
  5. Weijia Wu (47 papers)
  6. Hao Lu (99 papers)
  7. Chunhua Shen (404 papers)
  8. Lianwen Jin (116 papers)
  9. Xiang Bai (221 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com