Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Masked and Permuted Implicit Context Learning for Scene Text Recognition (2305.16172v2)

Published 25 May 2023 in cs.CV

Abstract: Scene Text Recognition (STR) is difficult because of the variations in text styles, shapes, and backgrounds. Though the integration of linguistic information enhances models' performance, existing methods based on either permuted LLMing (PLM) or masked LLMing (MLM) have their pitfalls. PLM's autoregressive decoding lacks foresight into subsequent characters, while MLM overlooks inter-character dependencies. Addressing these problems, we propose a masked and permuted implicit context learning network for STR, which unifies PLM and MLM within a single decoder, inheriting the advantages of both approaches. We utilize the training procedure of PLM, and to integrate MLM, we incorporate word length information into the decoding process and replace the undetermined characters with mask tokens. Besides, perturbation training is employed to train a more robust model against potential length prediction errors. Our empirical evaluations demonstrate the performance of our model. It not only achieves superior performance on the common benchmarks but also achieves a substantial improvement of $9.1\%$ on the more challenging Union14M-Benchmark.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Y. Zhu, C. Yao, and X. Bai, “Scene text detection and recognition: Recent advances and future trends,” Frontiers of Computer Science, vol. 10, no. 1, pp. 19–36, 2016.
  2. B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, “Aster: An attentional scene text recognizer with flexible rectification,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 9, pp. 2035–2048, 2018.
  3. J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, and H. Lee, “What is wrong with scene text recognition model comparisons? dataset and model analysis,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 4715–4723.
  4. Z. Qiao, Y. Zhou, D. Yang, Y. Zhou, and W. Wang, “Seed: Semantics enhanced encoder-decoder framework for scene text recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13 528–13 537.
  5. S. Long, X. He, and C. Yao, “Scene text detection and recognition: The deep learning era,” International Journal of Computer Vision, vol. 129, no. 1, pp. 161–184, 2021.
  6. D. Yu, X. Li, C. Zhang, T. Liu, J. Han, J. Liu, and E. Ding, “Towards accurate scene text recognition with semantic reasoning networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 113–12 122.
  7. S. Fang, H. Xie, Y. Wang, Z. Mao, and Y. Zhang, “Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7098–7107.
  8. B. Na, Y. Kim, and S. Park, “Multi-modal text recognition networks: Interactive enhancements between visual and semantic features,” in European Conference on Computer Vision.   Springer, 2022, pp. 446–463.
  9. Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” Advances in neural information processing systems, vol. 32, 2019.
  10. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  11. D. Bautista and R. Atienza, “Scene text recognition with permuted autoregressive sequence models,” in European Conference on Computer Vision.   Springer, 2022, pp. 178–196.
  12. Z. Qiao, Y. Zhou, J. Wei, W. Wang, Y. Zhang, N. Jiang, H. Wang, and W. Wang, “Pimnet: a parallel, iterative and mimicking network for scene text recognition,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 2046–2055.
  13. K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu, “Mpnet: Masked and permuted pre-training for language understanding,” Advances in Neural Information Processing Systems, vol. 33, pp. 16 857–16 867, 2020.
  14. Z. Xie, Y. Huang, Y. Zhu, L. Jin, Y. Liu, and L. Xie, “Aggregation cross-entropy for sequence recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6538–6547.
  15. H. Jiang, Y. Xu, Z. Cheng, S. Pu, Y. Niu, W. Ren, F. Wu, and W. Tan, “Reciprocal feature learning via explicit and implicit tasks in scene text recognition,” in Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part I.   Springer, 2021, pp. 287–303.
  16. B. Li, Y. Yuan, D. Liang, X. Liu, Z. Ji, J. Bai, W. Liu, and X. Bai, “When counting meets hmer: Counting-aware network for handwritten mathematical expression recognition,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII.   Springer, 2022, pp. 197–214.
  17. B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 11, pp. 2298–2304, 2016.
  18. H. Li, P. Wang, C. Shen, and G. Zhang, “Show, attend and read: A simple and strong baseline for irregular text recognition,” in Proceedings of the AAAI conference on artificial intelligence, 2019, pp. 8610–8617.
  19. T. Wang, Y. Zhu, L. Jin, C. Luo, X. Chen, Y. Wu, Q. Wang, and M. Cai, “Decoupled attention network for text recognition,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 12 216–12 224.
  20. Y. Wang, H. Xie, S. Fang, J. Wang, S. Zhu, and Y. Zhang, “From two to one: A new scene text recognizer with visual language modeling network,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 194–14 203.
  21. Y. Du, Z. Chen, C. Jia, X. Yin, T. Zheng, C. Li, Y. Du, and Y.-G. Jiang, “Svtr: Scene text recognition with a single visual model,” in Proceedings of the 31st International Joint Conference on Artificial Intelligence, 2022, pp. 884–890.
  22. Q. Jiang, J. Wang, D. Peng, C. Liu, and L. Jin, “Revisiting scene text recognition: A data perspective,” arXiv preprint arXiv:2307.08723, 2023.
  23. M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Reading text in the wild with convolutional neural networks,” International journal of computer vision, vol. 116, no. 1, pp. 1–20, 2016.
  24. A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text localisation in natural images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2315–2324.
  25. A. Mishra, K. Alahari, and C. Jawahar, “Scene text recognition using higher order language priors,” in BMVC-British machine vision conference.   BMVA, 2012.
  26. D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. De Las Heras, “Icdar 2013 robust reading competition,” in 2013 12th international conference on document analysis and recognition.   IEEE, 2013, pp. 1484–1493.
  27. K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recognition,” in 2011 International conference on computer vision.   IEEE, 2011, pp. 1457–1464.
  28. D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu et al., “Icdar 2015 competition on robust reading,” in 2015 13th international conference on document analysis and recognition (ICDAR).   IEEE, 2015, pp. 1156–1160.
  29. T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan, “Recognizing text with perspective distortion in natural scenes,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 569–576.
  30. A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan, “A robust arbitrary text detection system for natural scene images,” Expert Systems with Applications, vol. 41, no. 18, pp. 8027–8048, 2014.
  31. S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” in International Conference on Learning Representations, 2017. [Online]. Available: https://openreview.net/forum?id=Byj72udxe
  32. R. Atienza, “Vision transformer for fast and efficient scene text recognition,” in International Conference on Document Analysis and Recognition.   Springer, 2021, pp. 319–334.
  33. P. Wang, C. Da, and C. Yao, “Multi-granularity prediction for scene text recognition,” in European Conference on Computer Vision.   Springer, 2022, pp. 339–355.
  34. C. Da, P. Wang, and C. Yao, “Levenshtein ocr,” in European Conference on Computer Vision.   Springer, 2022, pp. 322–338.
  35. H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in International Conference on Machine Learning.   PMLR, 2021, pp. 10 347–10 357.
  36. L. N. Smith and N. Topin, “Super-convergence: Very fast training of neural networks using large learning rates,” in Artificial intelligence and machine learning for multi-domain operations applications, vol. 11006.   SPIE, 2019, pp. 369–386.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xiaomeng Yang (21 papers)
  2. Zhi Qiao (30 papers)
  3. Jin Wei (16 papers)
  4. Dongbao Yang (16 papers)
  5. Yu Zhou (335 papers)
Citations (4)