From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network (2108.09661v1)

Published 22 Aug 2021 in cs.CV

Abstract: In this paper, we abandon the dominant complex LLM and rethink the linguistic learning process in the scene text recognition. Different from previous methods considering the visual and linguistic information in two separate structures, we propose a Visual LLMing Network (VisionLAN), which views the visual and linguistic information as a union by directly enduing the vision model with language capability. Specially, we introduce the text recognition of character-wise occluded feature maps in the training stage. Such operation guides the vision model to use not only the visual texture of characters, but also the linguistic information in visual context for recognition when the visual cues are confused (e.g. occlusion, noise, etc.). As the linguistic information is acquired along with visual features without the need of extra LLM, VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition. Furthermore, an Occlusion Scene Text (OST) dataset is proposed to evaluate the performance on the case of missing character-wise visual cues. The state of-the-art results on several benchmarks prove our effectiveness. Code and dataset are available at https://github.com/wangyuxin87/VisionLAN.

PDF Abstract

Overview of the Visual LLMing Network for Scene Text Recognition

The paper presents a novel approach to addressing scene text recognition (STR) using a single integrated model, which the authors term as the Visual LLMing Network (VisionLAN). Unlike conventional approaches that treat visual and linguistic information as separate entities, VisionLAN integrates these dimensions into a unified framework, fostering an intrinsic capability within the visual model to process linguistic cues directly. Consequently, this method abandons the need for separate LLMs, thus reducing computational overhead and enhancing inference speed.

VisionLAN Architecture

VisionLAN’s architecture principally consists of three components: the backbone network, the Masked Language-aware Module (MLM), and the Visual Reasoning Module (VRM). Each plays a pivotal role in unifying visual-linguistic processing capabilities:

Backbone Network: Responsible for extracting visual features from input images, providing a foundation for subsequent reasoning stages.
Masked Language-aware Module (MLM): This module introduces a novel approach to linguistic feature integration by utilizing character-wise occluded feature maps during training. By doing so, it simulates environments where visual cues might be weak or ambiguous due to occlusions or noises, compelling the model to leverage linguistic context for accurate text recognition.
Visual Reasoning Module (VRM): Facilitates the combination of visual and linguistic information by using learned reasoning processes through transformers capable of modeling long-range dependencies, effectively supplementing visual features with linguistic context without additional LLMs.

Methodology and Results

During training, a unique aspect of VisionLAN is its approach to leveraging weakly supervised complementary learning with only word-level annotations. This design augments its ability to occlude information and guide the model to reinforce its linguistic feature integration. The VRM is designed to independently utilize character dependencies, learning to infer or predict occluded sections of text based on contextual understanding, which closely mimics human cognitive processes.

The experimental results demonstrate VisionLAN's superior performance by achieving state-of-the-art results across several benchmarks, such as IIIT5K, ICDAR2013, SVT, ICDAR2015, SVTP, and CUTE80. The system showcases improvements in processing images with confusing or degraded visual cues, proving its robust handling of complex image-text scenarios.

Practical and Theoretical Implications

VisionLAN provides a significant advancement in the field of scene text recognition. By effectively harnessing both visual and linguistic data within a singular network, its ability to manage occluded and noisy environments extends its applicability across diverse real-world scenarios where traditional STR systems would falter due to high computational requirements or inability to dynamically integrate linguistic rules.

The theoretical implications of VisionLAN suggest potential pathways for future AI research, particularly in integrating heterogeneous information types directly within neural networks. This approach could be extended beyond text recognition to more complex multimedia interpretations or other domains where cross-modal learning is required.

Future Directions

With the introduction of VisionLAN, several avenues for future exploration arise. Incorporating this unified approach into real-time applications like autonomous driving, augmented reality, or assistive technologies for visually impaired individuals may well be on the horizon. Furthermore, expanding the framework to process longer sequences or incorporating broader contextual cues from multi-language datasets could enhance its generalization capabilities, allowing for greater adaptability in various contexts.

Overall, VisionLAN represents a pragmatic stride forward in AI's ability to embed and utilize complex contextual relationships inherent in visual and linguistic data. This stride underlines a forward-looking direction for STR and potentially other fields where integrated information processing is key.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Yuxin Wang (132 papers)
Hongtao Xie (48 papers)
Shancheng Fang (11 papers)
Jing Wang (740 papers)
Shenggao Zhu (9 papers)
Yongdong Zhang (119 papers)

Citations (144)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - wangyuxin87/VisionLAN: A PyTorch implementation of "From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network" (ICCV2021) (97 stars)