Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Spotting (2211.10578v2)

Published 19 Nov 2022 in cs.CV

Abstract: Scene text spotting is of great importance to the computer vision community due to its wide variety of applications. Recent methods attempt to introduce linguistic knowledge for challenging recognition rather than pure visual classification. However, how to effectively model the linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of LLMs comes from 1) implicit LLMing; 2) unidirectional feature representation; and 3) LLM with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting. Firstly, the autonomous suggests enforcing explicitly LLMing by decoupling the recognizer into vision model and LLM and blocking gradient flow between both models. Secondly, a novel bidirectional cloze network (BCN) as the LLM is proposed based on bidirectional feature representation. Thirdly, we propose an execution manner of iterative correction for the LLM which can effectively alleviate the impact of noise input. Finally, to polish ABINet++ in long text recognition, we propose to aggregate horizontal features by embedding Transformer units inside a U-Net, and design a position and content attention module which integrates character order and content to attend to character features precisely. ABINet++ achieves state-of-the-art performance on both scene text recognition and scene text spotting benchmarks, which consistently demonstrates the superiority of our method in various environments especially on low-quality images. Besides, extensive experiments including in English and Chinese also prove that, a text spotter that incorporates our LLMing method can significantly improve its performance both in accuracy and speed compared with commonly used attention-based recognizers.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Shancheng Fang (11 papers)
  2. Zhendong Mao (55 papers)
  3. Hongtao Xie (48 papers)
  4. Yuxin Wang (132 papers)
  5. Chenggang Yan (54 papers)
  6. Yongdong Zhang (119 papers)
Citations (48)