HAAP: Vision-context Hierarchical Attention Autoregressive with Adaptive Permutation for Scene Text Recognition (2405.09125v1)
Abstract: Internal LLM (LM)-based methods use permutation LLMing (PLM) to solve the error correction caused by conditional independence in external LM-based methods. However, random permutations of human interference cause fit oscillations in the model training, and Iterative Refinement (IR) operation to improve multimodal information decoupling also introduces additional overhead. To address these issues, this paper proposes the Hierarchical Attention autoregressive Model with Adaptive Permutation (HAAP) to enhance the location-context-image interaction capability, improving autoregressive generalization with internal LM. First, we propose Implicit Permutation Neurons (IPN) to generate adaptive attention masks to dynamically exploit token dependencies. The adaptive masks increase the diversity of training data and prevent model dependency on a specific order. It reduces the training overhead of PLM while avoiding training fit oscillations. Second, we develop Cross-modal Hierarchical Attention mechanism (CHA) to couple context and image features. This processing establishes rich positional semantic dependencies between context and image while avoiding IR. Extensive experimental results show the proposed HAAP achieves state-of-the-art (SOTA) performance in terms of accuracy, complexity, and latency on several datasets.
- H. Song, L. Dong, W.-N. Zhang, T. Liu, and F. Wei, “Clip models are few-shot learners: Empirical studies on vqa and visual entailment,” arXiv preprint arXiv:2203.07190, 2022.
- Y. Taki and E. Zemmouri, “Scene text recognition for text-based traffic signs,” in Advances in Intelligent Traffic and Transportation Systems. IOS Press, 2023, pp. 67–77.
- H. Luo, L. Ji, M. Zhong, Y. Chen, W. Lei, N. Duan, and T. Li, “Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning,” Neurocomputing, vol. 508, pp. 293–304, 2022.
- Z. Chen, F. Yin, Q. Yang, and C.-L. Liu, “Cross-lingual text image recognition via multi-hierarchy cross-modal mimic,” IEEE Transactions on Multimedia, vol. 25, pp. 4830–4841, 2023.
- Z. Li, X. Wang, Y. Liu, L. Jin, Y. Huang, and K. Ding, “Improving handwritten mathematical expression recognition via similar symbol distinguishing,” IEEE Transactions on Multimedia, vol. 26, pp. 90–102, 2024.
- B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, “Aster: An attentional scene text recognizer with flexible rectification,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 9, pp. 2035–2048, 2018.
- D. Yu, X. Li, C. Zhang, T. Liu, J. Han, J. Liu, and E. Ding, “Towards accurate scene text recognition with semantic reasoning networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 12 113–12 122.
- Y. Wang, H. Xie, S. Fang, J. Wang, S. Zhu, and Y. Zhang, “From two to one: A new scene text recognizer with visual language modeling network,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 194–14 203.
- S. Fang, H. Xie, Y. Wang, Z. Mao, and Y. Zhang, “Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7098–7107.
- B. Na, Y. Kim, and S. Park, “Multi-modal text recognition networks: Interactive enhancements between visual and semantic features,” in European Conference on Computer Vision. Springer, 2022, pp. 446–463.
- X. Wu, B. Tang, M. Zhao, J. Wang, and Y. Guo, “Str transformer: a cross-domain transformer for scene text recognition,” Applied Intelligence, vol. 53, no. 3, pp. 3444–3458, 2023.
- D. Bautista and R. Atienza, “Scene text recognition with permuted autoregressive sequence models,” in European Conference on Computer Vision. Springer, 2022, pp. 178–196.
- S. Zhao, X. Wang, L. Zhu, and Y. Yang, “Clip4str: A simple baseline for scene text recognition with pre-trained vision-language model,” arXiv preprint arXiv:2305.14014, 2023.
- Z. Qiao, Y. Zhou, D. Yang, Y. Zhou, and W. Wang, “Seed: Semantics enhanced encoder-decoder framework for scene text recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 13 528–13 537.
- N. Nguyen, T. Nguyen, V. Tran, M.-T. Tran, T. D. Ngo, T. H. Nguyen, and M. Hoai, “Dictionary-guided scene text recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7383–7392.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
- P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai, “Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 67–83.
- Z. Wan, M. He, H. Chen, X. Bai, and C. Yao, “Textscanner: Reading characters in order for robust scene text recognition,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 12 120–12 127.
- M. Liang and X. Hu, “Recurrent convolutional neural network for object recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3367–3375.
- A. Graves and A. Graves, “Connectionist temporal classification,” Supervised sequence labelling with recurrent neural networks, pp. 61–93, 2012.
- B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 11, pp. 2298–2304, 2016.
- D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
- Z. Cheng, Y. Xu, F. Bai, Y. Niu, S. Pu, and S. Zhou, “Aon: Towards arbitrarily-oriented text recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5571–5579.
- C.-Y. Lee and S. Osindero, “Recursive recurrent nets with attention modeling for ocr in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2231–2239.
- H. Li, P. Wang, C. Shen, and G. Zhang, “Show, attend and read: A simple and strong baseline for irregular text recognition,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 8610–8617.
- M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Synthetic data and artificial neural networks for natural scene text recognition,” arXiv preprint arXiv:1406.2227, 2014.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- R. Atienza, “Vision transformer for fast and efficient scene text recognition,” in International Conference on Document Analysis and Recognition. Springer, 2021, pp. 319–334.
- J. Baek, Y. Matsui, and K. Aizawa, “What if we only use real datasets for scene text recognition? toward scene text recognition with fewer labels,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3113–3122.
- L. Wu, Y. Xu, J. Hou, C. L. P. Chen, and C.-L. Liu, “A two-level rectification attention network for scene text recognition,” IEEE Transactions on Multimedia, vol. 25, pp. 2404–2414, 2023.
- M. Li, B. Fu, H. Chen, J. He, and Y. Qiao, “Dual relation network for scene text recognition,” IEEE Transactions on Multimedia, vol. 25, pp. 4094–4107, 2023.
- C. Xue, J. Huang, W. Zhang, S. Lu, C. Wang, and S. Bai, “Image-to-character-to-word transformers for accurate scene text recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- S. Xia, J. Kou, N. Liu, and T. Yin, “Scene text recognition based on two-stage attention and multi-branch feature fusion module,” Applied Intelligence, vol. 53, no. 11, pp. 14 219–14 232, 2023.
- L. Diao, X. Tang, J. Wang, G. Xie, and J. Hu, “Hierarchical visual-semantic interaction for scene text recognition,” Information Fusion, vol. 102, p. 102080, 2024.
- X. Zhang, B. Zhu, X. Yao, Q. Sun, R. Li, and B. Yu, “Context-based contrastive learning for scene text recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 3353–3361.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- T. Wang, Y. Zhu, L. Jin, C. Luo, X. Chen, Y. Wu, Q. Wang, and M. Cai, “Decoupled attention network for text recognition,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 12 216–12 224.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in International conference on machine learning. PMLR, 2021, pp. 10 347–10 357.
- Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” Advances in neural information processing systems, vol. 32, 2019.
- A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text localisation in natural images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2315–2324.
- A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie, “Coco-text: Dataset and benchmark for text detection and recognition in natural images,” arXiv preprint arXiv:1601.07140, 2016.
- B. Shi, C. Yao, M. Liao, M. Yang, P. Xu, L. Cui, S. Belongie, S. Lu, and X. Bai, “Icdar2017 competition on reading chinese text in the wild (rctw-17),” in 2017 14th iapr international conference on document analysis and recognition (ICDAR), vol. 1. IEEE, 2017, pp. 1429–1434.
- Y. Zhang, L. Gueguen, I. Zharkov, P. Zhang, K. Seifert, and B. Kadlec, “Uber-text: A large-scale dataset for optical character recognition from street-level imagery,” in SUNw: Scene Understanding Workshop-CVPR, vol. 2017, 2017, p. 5.
- C. K. Chng, Y. Liu, Y. Sun, C. C. Ng, C. Luo, Z. Ni, C. Fang, S. Zhang, J. Han, E. Ding et al., “Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art,” in 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019, pp. 1571–1576.
- Y. Sun, Z. Ni, C.-K. Chng, Y. Liu, C. Luo, C. C. Ng, J. Han, E. Ding, J. Liu, D. Karatzas et al., “Icdar 2019 competition on large-scale street view text with partial labeling-rrc-lsvt,” in 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019, pp. 1557–1562.
- N. Nayef, Y. Patel, M. Busta, P. N. Chowdhury, D. Karatzas, W. Khlif, J. Matas, U. Pal, J.-C. Burie, C.-l. Liu et al., “Icdar2019 robust reading challenge on multi-lingual scene text detection and recognition—rrc-mlt-2019,” in 2019 International conference on document analysis and recognition (ICDAR). IEEE, 2019, pp. 1582–1587.
- R. Zhang, Y. Zhou, Q. Jiang, Q. Song, N. Li, K. Zhou, L. Wang, D. Wang, M. Liao, M. Yang et al., “Icdar 2019 robust reading challenge on reading chinese text on signboard,” in 2019 international conference on document analysis and recognition (ICDAR). IEEE, 2019, pp. 1577–1581.
- A. Singh, G. Pang, M. Toh, J. Huang, W. Galuba, and T. Hassner, “Textocr: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8802–8812.
- I. Krylov, S. Nosov, and V. Sovrasov, “Open images v5 text annotation and yet another mask text spotter,” in Asian Conference on Machine Learning. PMLR, 2021, pp. 379–389.
- A. Mishra, K. Alahari, and C. Jawahar, “Scene text recognition using higher order language priors,” in BMVC-British machine vision conference. BMVA, 2012.
- A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan, “A robust arbitrary text detection system for natural scene images,” Expert Systems with Applications, vol. 41, no. 18, pp. 8027–8048, 2014.
- K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recognition,” in 2011 International conference on computer vision. IEEE, 2011, pp. 1457–1464.
- T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan, “Recognizing text with perspective distortion in natural scenes,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 569–576.
- D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. De Las Heras, “Icdar 2013 robust reading competition,” in 2013 12th international conference on document analysis and recognition. IEEE, 2013, pp. 1484–1493.
- D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu et al., “Icdar 2015 competition on robust reading,” in 2015 13th international conference on document analysis and recognition (ICDAR). IEEE, 2015, pp. 1156–1160.
- M. Li, B. Fu, Z. Zhang, and Y. Qiao, “Character-aware sampling and rectification for scene text recognition,” IEEE Transactions on Multimedia, vol. 25, pp. 649–661, 2023.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
- G. Lan, “An optimal method for stochastic composite optimization,” Mathematical Programming, vol. 133, no. 1-2, pp. 365–397, 2012.
- L. N. Smith and N. Topin, “Super-convergence: Very fast training of neural networks using large learning rates,” in Artificial intelligence and machine learning for multi-domain operations applications, vol. 11006. SPIE, 2019, pp. 369–386.
- E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, “Randaugment: Practical automated data augmentation with a reduced search space,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 702–703.
Collections
Sign up for free to add this paper to one or more collections.