Visual Text Processing: A Comprehensive Review and Unified Evaluation (2504.21682v2)

Published 30 Apr 2025 in cs.CV

Abstract: Visual text is a crucial component in both document and scene images, conveying rich semantic information and attracting significant attention in the computer vision community. Beyond traditional tasks such as text detection and recognition, visual text processing has witnessed rapid advancements driven by the emergence of foundation models, including text image reconstruction and text image manipulation. Despite significant progress, challenges remain due to the unique properties that differentiate text from general objects. Effectively capturing and leveraging these distinct textual characteristics is essential for developing robust visual text processing models. In this survey, we present a comprehensive, multi-perspective analysis of recent advancements in visual text processing, focusing on two key questions: (1) What textual features are most suitable for different visual text processing tasks? (2) How can these distinctive text features be effectively incorporated into processing frameworks? Furthermore, we introduce VTPBench, a new benchmark that encompasses a broad range of visual text processing datasets. Leveraging the advanced visual quality assessment capabilities of multimodal LLMs (MLLMs), we propose VTPScore, a novel evaluation metric designed to ensure fair and reliable evaluation. Our empirical study with more than 20 specific models reveals substantial room for improvement in the current techniques. Our aim is to establish this work as a fundamental resource that fosters future exploration and innovation in the dynamic field of visual text processing. The relevant repository is available at https://github.com/shuyansy/Visual-Text-Processing-survey.

PDF Abstract

Visual text, prevalent in both document and scene images, carries significant semantic information vital for applications like image retrieval, assistive technologies, and document AI. Traditional research focused on text spotting (detection and recognition), but recent advancements, particularly driven by foundation models, have expanded the field to visual text processing, encompassing tasks like text image reconstruction and manipulation. This paper (Shu et al., 30 Apr 2025 ) provides a comprehensive review of this dynamic field, addressing two key questions: what textual features are suitable for different tasks, and how are these features incorporated into processing frameworks? The authors also introduce VTPBench, a new benchmark, and VTPScore, an evaluation metric based on multimodal LLMs (MLLMs), to provide a unified evaluation framework.

Visual text processing tasks are broadly categorized based on their output:

Text Image Reconstruction: Aims to restore or enhance the quality of low-fidelity text images. The output $\boldsymbol{Y}$ $Y$ maintains semantic consistency with the input $\boldsymbol{X}$ $X$ but with refined pixel distribution. Tasks include:
- Text Image Super-resolution: Reconstructs high-resolution text images from low-resolution ones. Unlike general image SR, it is foreground-centric, prioritizing text clarity and semantic integrity, especially for complex characters.
- Document Image Dewarping (DID): Converts distorted document images into flat versions by learning coordinate mappings. Distortions from camera angles or paper deformations impair readability. Challenges include reliance on synthetic data for ground truth and handling diverse deformations.
- Text Image Enhancement (TIE): Mitigates negative effects like shadows, stains, blur, and uneven illumination. Requires preserving text structure and content integrity. Sub-categories include illumination removal and impurity removal. Whether a single model can handle various degradations remains an open question.
Text Image Manipulation: Involves modifying visual text while preserving or generating visual consistency. The output $\boldsymbol{Y}$ $Y$ either maintains consistency with $\boldsymbol{X}$ $X$ (removal/editing) or complies with input conditions (generation), with text content eliminated, modified, or appended. Tasks include:
- Scene Text Removal (STR): Deletes text from images and inpainting the background. Essential for privacy protection. Consists of text localization and background reconstruction. Auxiliary removal methods using text masks generally outperform direct methods.
- Scene Text Editing: Modifies text attributes (style) or content while ensuring seamless integration. Can be style editing (altering appearance, color, background) or content editing (altering text while preserving style). Recent works attempt unified style and content editing.
- Scene Text Generation: Synthesizes text images with diverse appearances. Crucial for generating training data and applications like design. Must account for rendering fidelity and overall image quality.

The paper also covers related areas like Scene Text Segmentation (pixel-level mask localization for fine-grained processing) and Editing Detection (identifying tampered text, more challenging for subtle document edits).

Key textual features utilized across tasks include:

Structure: Layout, orientation, text lines, boundaries, 3D information, used significantly in Document Image Dewarping and Scene Text Generation (for geometry and layout control).
Stroke: Character glyphs and fine-grained details, crucial for Text Image Super-resolution, Text Image Enhancement, Scene Text Removal, and Scene Text Editing (for rendering guidance).
Semantics: Language information, used in Text Image Super-resolution (prior guidance, recognition supervision), Text Image Enhancement (recognition supervision), and Scene Text Editing (recognition supervision).
Style: Color, font, texture, used predominantly in Scene Text Editing and Scene Text Generation for appearance transfer and realistic synthesis.

Different learning paradigms are employed to incorporate these features:

Prior Guidance/Supervision: Using pre-trained models or auxiliary losses (e.g., recognition loss, stroke-focus loss, edge loss) to guide the network towards textual content (Text Image SR, TIE, STE).
Two-stage vs. End-to-end Learning: Breaking down complex tasks like DID into explicit feature extraction (e.g., boundary/text line segmentation) followed by dewarping (two-stage) or training a single network to predict the mapping directly (end-to-end).
Knowledge Transfer, Multi-task Learning, Progressive Learning: Strategies for leveraging text stroke information in STR by transferring knowledge from detection/segmentation models, training joint networks, or iteratively refining results.
Explicit vs. Implicit Transfer, Inpainting: Different ways to handle text style in STE, from explicitly transferring style via conversion modules to implicitly learning styles in latent space or leveraging diffusion models for inpainting.
Template/Prompt Representation: Using text image templates or fine-grained text embeddings as conditions in generative models for STE and STG.
Adversarial Learning: Using GANs for stylistic transfer in STG to mimic real-world text appearances.

Evaluating visual text processing models is challenging due to task diversity and inconsistent benchmarks. The paper introduces VTPBench (Shu et al., 30 Apr 2025 ), a multi-task benchmark with 4,305 samples across six tasks, compiled from existing datasets (TextZoom, Real-CE, DocUNet, DIR300, DocReal, UVDoc, various TIE datasets, SCUT-Syn, SCUT-EnsText, PosterErase, Flickr-ST, Tamper, ScenePair, MARIO-Eval, DrawTextExt, AnyText, VisualParagraphy). To provide unified evaluation, they propose VTPScore (Shu et al., 30 Apr 2025 ), an MLLM-based metric (using GPT-4o) that assesses both visual quality and text readability using task-specific prompts and structured JSON output. Empirical studies on VTPBench and existing benchmarks show the effectiveness of various methods across tasks and demonstrate high consistency between VTPScore and human evaluation.

Despite significant progress, several open challenges remain:

Training Data: Scarcity of high-quality, labeled real-world data, especially paired data and natural scene images for some tasks. The trade-off between dataset quantity and quality needs further exploration, along with advancements in self/semi-supervised learning and domain adaptation. The development of versatile, human-aligned evaluation metrics is also crucial.
Efficiency and Complexity: Many models, particularly those based on Transformers and Diffusion Models, suffer from high computational complexity and slow inference speeds, limiting practical deployment. Future work should focus on developing more streamlined architectures, exploring techniques like model distillation, and emphasizing end-to-end designs.
Extension to Videos: Processing visual text in videos presents challenges due to data annotation complexity (motion, temporal dependencies) and the need for sophisticated architectures capable of handling high-dimensional spatio-temporal data effectively.
Unified Framework: Current methods are often task-specific, while real-world applications require multi-faceted capabilities (removal, editing, generation) and understanding of both text and general objects. Developing a cohesive, adaptable multi-task framework, potentially leveraging MLLMs, is a promising direction.
MLLMs-based System: While MLLMs show potential for visual text processing, adapting language-centric architectures for vision-centric tasks, ensuring high fidelity for text-rich images, and scaling training data for MLLM fine-tuning are open research avenues.

In conclusion, this paper provides a valuable and timely review of the visual text processing field, outlining its evolution, key techniques, challenges, and future directions. The proposed VTPBench and VTPScore contribute to standardizing evaluation, fostering more comparable and reliable research progress.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

Yan Shu (25 papers)
Weichao Zeng (7 papers)
Fangmin Zhao (2 papers)
Zeyu Chen (48 papers)
Zhenhang Li (6 papers)
Xiaomeng Yang (21 papers)
Yu Zhou (335 papers)
Paolo Rota (29 papers)
Xiang Bai (222 papers)
Lianwen Jin (116 papers)
Xu-Cheng Yin (35 papers)
Nicu Sebe (270 papers)

Visual Text Processing: A Comprehensive Review and Unified Evaluation (2504.21682v2)

Related Papers

GitHub

YouTube