Text-Rich Visual Question Answering
- Text-rich VQA is a cross-modal task that integrates OCR-derived scene text with visual and linguistic data to answer natural language queries.
- It employs modular architectures such as dual-branch networks and multimodal transformers to enhance the fusion of visual and text cues.
- Research focuses on improving OCR accuracy, multi-hop reasoning, and data augmentation to address challenges like text noise and multilingual scripts.
Text-rich Visual Question Answering (VQA) is a cross-modal reasoning task that requires both machine reading of textual content embedded within images (via scene text or documents) and integration of this text with visual and linguistic cues to answer natural language queries. Unlike conventional VQA, text-rich VQA tasks—also termed Text-VQA or Scene Text VQA—demand precise handling of optical character recognition (OCR) outputs, explicit modeling of scene-text–question–vision relationships, and the ability to operate over open or extremely large answer vocabularies. This paradigm has catalyzed research into new multimodal fusion architectures, data augmentation strategies, cross-modal alignment, and instruction-tuned large language and multimodal models.
1. Task Definition and Foundational Datasets
Text-rich VQA formally requires models to answer open-ended or extraction-style questions about visual-text content, most typically in natural images, documentary scans, or visually complex web layouts. Key datasets include:
- ST-VQA ("Scene Text Visual Question Answering") (Biten et al., 2019): 23,038 images and 31,791 QAs, requiring answers that can only be obtained by reading image text. Its three task variants—a per-image contextualized lexicon, a global lexicon, and a fully open-vocabulary generation mode—test the ability to ground answers in scene text, absent language-only shortcuts.
- TextVQA [standard in (Liu et al., 2023, Lu et al., 2021, Wang et al., 2022)]: Natural images with up to 50 OCR tokens per image, combining object detection and open-vocabulary answer targets.
- Document VQA (DocVQA, OCRVQA): Focus on documents, scanned forms, and tables; emphasize layout and text-structure reasoning (Liu et al., 2023).
- MTVQA ("Benchmarking Multilingual Text-Centric Visual Question Answering") (Tang et al., 2024): 8,794 images, 28,607 QAs, spanning nine languages/scripts, introducing cross-lingual and non-Latin OCR challenges.
Evaluation metrics are dataset-specific, ranging from exact-match accuracy to Average Normalized Levenshtein Similarity (ANLS) (Biten et al., 2019), as well as semantic scoring using LLM-generated judgments (Vu et al., 16 Jul 2025).
2. Core Technical Approaches and Architectures
2.1 Modular and Dual-Branch Models
Pipelines typically disentangle visual text reading (OCR) and linguistic reasoning, with many architectures emphasizing modularity:
- Training-Free OCR+LLM/MLLM Pipelines (Liu et al., 2023): An external OCR model (e.g., PaddleOCR) produces token/layout sequences, which are formatted (including few-shot/in-context exemplars) and ingested by a LLM (LLM, e.g., Vicuna) or a Multimodal LLM (MLLM). This strategy is entirely training-free, leveraging prompt engineering to operationalize language module capabilities for downstream reasoning.
- Text-Aware Dual Routing Network (TDR) (Jiang et al., 2022): Employs a two-branch architecture—one for conventional VQA classification over frequent answers, another a pointer network for assembling arbitrary OCR token sequences—under control of a learned gating network to route questions based on predicted text reliance.
2.2 Multi-modal Fusion and Scene-Text Alignment
- Multimodal Transformer Backbones (e.g., M4C, TAP) (Hegde et al., 2023, Lu et al., 2021): Combine embeddings of OCR tokens, object features (from Faster R-CNN or equivalents), and question tokens, feeding into deep self-attention stacks for cross-modal co-representation.
- Cross-media Reasoning and Entity Alignment (KECMRN, VTQA) (Chen et al., 2023): Explicit, iterative cross-modal modality fusion, including key-entity extraction, self-attention over modalities, and pointer mechanisms for answer generation, designed for multi-hop grounding across text and image.
- Relational Attention Mechanisms (RUArt) (Jin et al., 2020): Incorporate semantic and positional attention to model relationships between OCR tokens and scene objects, facilitating answer formulation via semantic matching or external reasoning.
2.3 Data Augmentation and Question Generation
To address the sparsity and limited exploration of scene text:
- TAG (Text-aware visual question-answer Generation) (Wang et al., 2022): Augments training data by generating new QA pairs using a multimodal pointer-generator transformer conditioned on unexploited OCR tokens as target answers, substantially improving model generalization and scene understanding upon retraining.
- Dataset Union and Mixed Training (Hegde et al., 2023): Merging TextVQA, ST-VQA, and VQA datasets (filtered for images containing scene text) enhances the interplay between visual and text cues, mitigating answer biases inherent to text-only supervision.
3. Quantitative Benchmarks and Performance
A consistent finding is that the OCR module, rather than LLM or vision transformer, is the primary performance bottleneck in text-rich VQA:
$\begin{array}{l|cccc} \text{Model} & \text{DocVQA} & \text{OCRVQA} & \text{StVQA} & \text{TextVQA} \ \hline \text{LLaVA} & 0.0514 & 0.2136 & 0.2485 & 0.3281 \ \text{MiniGPT-4} & 0.0406 & 0.1792 & 0.1682 & 0.2352 \ \text{PaddleOCR + LLaVA (13B)} & 0.3647 & 0.2847 & 0.3516 & 0.4810 \ \text{PaddleOCR + Vicuna (13B)} & \mathbf{0.4528} & \mathbf{0.4024} & 0.2881 & 0.4742 \ \end{array}$
((Liu et al., 2023), Table 2, accuracy metric)
- Augmenting MLLMs with external OCR yields 10–40+ point gains, with LLM scale showing diminishing returns beyond ~13B parameters—the vision/OCR quality dominates.
- Replacement of noisy OCR outputs by ground truth yields 30–40 point improvements (StVQA/TextVQA) (Liu et al., 2023, Jin et al., 2020).
- Data augmentation with generated QA (TAG) confers +1–5 point gains across benchmarks (TextVQA, ST-VQA) (Wang et al., 2022).
- On multilingual MTVQA, despite top commercial MLLMs, per-language accuracy ranges from 3.4–40.6%, with non-Latin scripts notably harder (Tang et al., 2024).
4. Principal Bottlenecks and Model Analysis
The most significant challenges are:
- Vision/OCR bottleneck: OCR errors—especially low-contrast, stylized, or script-diverse text—cause cascading VQA failures. Even top-performing models gain more from improved text recognition than from scaling the language or fusion modules (Liu et al., 2023, Lu et al., 2021, Jin et al., 2020).
- Multi-hop and Reasoning: Standard fusion architectures show limited ability to perform multi-step reasoning or deeply integrate layout and textual relationships; explicitly designed CMR layers or reasoning modules substantially improve multi-hop tasks (Chen et al., 2023).
- Biases from text-only supervision: Models trained only on text-rich VQA data may overfit frequent answer strings independent of image context (e.g., "STOP" on signboards) (Hegde et al., 2023); mixing pure-vision VQA examples forces grounding.
- Multilingual/Script limitations: MTVQA demonstrates that state-of-the-art MLLMs perform poorly on non-Latin scripts and require explicit domain adaptation or specialized OCR capacity (Tang et al., 2024).
5. Best Practices: Prompting, Modularization, and Training
- Prompt Engineering (for LLMs/MLLMs): Few-shot, task-specific templates injecting recognized OCR text and well-chosen in-context exemplars allow training-free application of LLMs, achieving superior results relative to monolithic multimodal transformers (Liu et al., 2023).
- Modular Architecture: Decoupling OCR and language/vision reasoning models simplifies system upgrades, increases interpretability, and enables separate focused improvements in each module (Liu et al., 2023).
- Instruction-Tuning and Multi-source Fusion: Only instruction-tuned MLLMs capable of consuming and integrating OCR tokens benefit meaningfully from pipelined OCR input; others tend to ignore or degrade these signals (Liu et al., 2023).
- Scene Text Grouping and Multi-source Selection: Techniques such as spatial clustering of OCR tokens (LOGOS) and fusion of multi-engine OCR outputs further improve answer fidelity, especially in complex or noisy visual text environments (Lu et al., 2021).
6. Evaluation, Ablations, and Future Research Directions
- Metric Adaptation: ANLS and similar string-similarity metrics are vital for principled evaluation given unstandardized or noisy text (Biten et al., 2019, Lu et al., 2021, Vu et al., 16 Jul 2025).
- Ablation Findings: The inclusion of object semantics, spatial features, answer-centric attention, and scene text clustering has measurable impacts (0.5–5+ points per architectural or training tweak) (Wang et al., 2022, Lu et al., 2021).
- Research Frontiers: Directions include:
- End-to-end trainable OCR–VQA pipelines.
- Multilingual and script-adaptive recognition and reasoning, as prompted by MTVQA's results (Tang et al., 2024).
- Integration of external knowledge and reasoning-capable modules for multi-hop and commonsense tasks (Chen et al., 2023, Jin et al., 2020).
- Data augmentation via adversarial QA generation and dynamic curriculum learning (Wang et al., 2022).
Text-rich VQA thus continues to define the intersection of vision, reading, and reasoning, with progress currently hinging on advances in robust, script-diverse text recognition, multimodal fusion architectures, cross-lingual adaptation, and data-driven augmentation strategies for improved generalization and semantic depth.