Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer (2404.12734v3)

Published 19 Apr 2024 in cs.CV

Abstract: With the continuous development of Optical Character Recognition (OCR) and the expansion of application fields, text recognition in complex scenes has become a key challenge. Factors such as multiple fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models. Although OCR models based on deep learning have performed well in specific fields or similar datasets in recent years, the generalization ability and robustness of the model are still a big challenge when facing complex environments with multiple scenes. Furthermore, training an OCR model from scratch or fine-tuning all parameters is very demanding on computing resources and inference time, which limits the flexibility of its application. This study focuses on a fundamental aspect of mixed text recognition in response to the challenges mentioned above, which involves effectively fine-tuning the pre-trained basic OCR model to demonstrate exceptional performance across various downstream tasks. To this end, we propose a parameter-efficient mixed text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR. This method embeds DoRA into the image encoder and LoRA into the internal structure of the text decoder, enabling efficient parameter fine-tuning for downstream tasks. Experiments show that compared to similar parameter adjustment methods, our model DLoRA-TrOCR has the smallest number of parameters and performs better. It can achieve state-of-the-art performance on complex scene datasets involving simultaneous recognition of mixed handwritten, printed and street view texts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. ISBN 0-7695-1960-1. URL https://ieeexplore.ieee.org/xpl/conhome/8701/proceeding.
  2. Efficient automated processing of the unstructured documents using artificial intelligence: A systematic literature review and future directions. IEEE Access, 9:72894–72936, 2021. URL https://api.semanticscholar.org/CorpusID:234831463.
  3. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
  4. Adaptive low rank adaptation of segment anything to salient object detection. ArXiv, abs/2308.05426, 2023. URL https://api.semanticscholar.org/CorpusID:260775963.
  5. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7094–7103, 2021. URL https://api.semanticscholar.org/CorpusID:232185272.
  6. M. Fujitake. Dtrocr: Decoder-only transformer for optical character recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 8025–8035, January 2024.
  7. Recent advances in convolutional neural networks. Pattern recognition, 77:354–377, 2018.
  8. Lora: Low-rank adaptation of large language models. ArXiv, abs/2106.09685, 2021. URL https://api.semanticscholar.org/CorpusID:235458009.
  9. Icdar2019 competition on scanned receipt ocr and information extraction. 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520, 2019. URL https://api.semanticscholar.org/CorpusID:211026630.
  10. Icdar 2013 robust reading competition. 2013 12th International Conference on Document Analysis and Recognition, pages 1484–1493, 2013. URL https://api.semanticscholar.org/CorpusID:206777226.
  11. Icdar 2015 competition on robust reading. 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pages 1156–1160, 2015. URL https://api.semanticscholar.org/CorpusID:13322740.
  12. Trocr: Transformer-based optical character recognition with pre-trained models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 13094–13102, 2023a.
  13. Ecg classification with dual models: Xgboost voting and deep learning with attention. In 2023 16th International Conference on Advanced Computer Theory and Engineering (ICACTE), pages 202–206, 2023b. 10.1109/ICACTE59887.2023.10335476.
  14. Review of scene text detection and recognition. Archives of Computational Methods in Engineering, 27:433 – 454, 2019. URL https://api.semanticscholar.org/CorpusID:128295528.
  15. Scene text detection and recognition: The deep learning era. International Journal of Computer Vision, 129(1):161–184, 2021.
  16. Handwritten optical character recognition (ocr): A comprehensive systematic literature review (slr). IEEE Access, 8:142642–142668, 2020. URL https://api.semanticscholar.org/CorpusID:209531740.
  17. Scene text recognition using higher order language priors. In British Machine Vision Conference, 2009. URL https://api.semanticscholar.org/CorpusID:9695967.
  18. Recognizing text with perspective distortion in natural scenes. 2013 IEEE International Conference on Computer Vision, pages 569–576, 2013. URL https://api.semanticscholar.org/CorpusID:5619635.
  19. Empirical analysis of the strengths and weaknesses of peft techniques for llms. arXiv preprint arXiv:2304.14999, 2023.
  20. Laia: A deep learning toolkit for htr. https://github.com/jpuigcerver/Laia, 2016. GitHub repository.
  21. Multilingual large language model: A survey of resources, taxonomy and frontiers. 2024. URL https://api.semanticscholar.org/CorpusID:269005862.
  22. A deep learning solution to detect text-types using a convolutional neural network. 2021. URL https://api.semanticscholar.org/CorpusID:236644202.
  23. Text detection and recognition in the wild: A review. ArXiv, abs/2006.04305, 2020. URL https://api.semanticscholar.org/CorpusID:219531885.
  24. A robust arbitrary text detection system for natural scene images. Expert Syst. Appl., 41:8027–8048, 2014. URL https://api.semanticscholar.org/CorpusID:15559857.
  25. Nrtr: A no-recurrence sequence-to-sequence model for scene text recognition. 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 781–786, 2018. URL https://api.semanticscholar.org/CorpusID:46931567.
  26. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 39(11):2298–2304, 2016.
  27. A. C. Stickland and I. Murray. BERT and PALs: Projected attention layers for efficient adaptation in multi-task learning. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5986–5995. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/stickland19a.html.
  28. Attention is all you need. In Neural Information Processing Systems, 2017. URL https://api.semanticscholar.org/CorpusID:13756489.
  29. End-to-end scene text recognition. 2011 International Conference on Computer Vision, pages 1457–1464, 2011. URL https://api.semanticscholar.org/CorpusID:14136313.
  30. From two to one: A new scene text recognizer with visual language modeling network. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 14174–14183, 2021. URL https://api.semanticscholar.org/CorpusID:237267316.
  31. A holistic representation guided attention network for scene text recognition. Neurocomputing, 414:67–75, 2019. URL https://api.semanticscholar.org/CorpusID:220363911.
  32. Dora: Weight-decomposed low-rank adaptation. ArXiv, abs/2402.09353, 2024. URL https://api.semanticscholar.org/CorpusID:267657886.
  33. Towards accurate scene text recognition with semantic reasoning networks. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12110–12119, 2020a. URL https://api.semanticscholar.org/CorpusID:214693009.
  34. Towards accurate scene text recognition with semantic reasoning networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12110–12119, 2020b. 10.1109/CVPR42600.2020.01213.
  35. A review of recurrent neural networks: Lstm cells and network architectures. Neural computation, 31(7):1235–1270, 2019.
  36. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. ArXiv, abs/2302.09419, 2023. URL https://api.semanticscholar.org/CorpusID:257039063.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Da Chang (3 papers)
  2. Yu Li (377 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com