Robustness Evaluation of OCR-based Visual Document Understanding under Multi-Modal Adversarial Attacks
This work presents a comprehensive investigation of the robustness of Optical Character Recognition (OCR)-based Visual Document Understanding (VDU) models when subjected to multi-modal adversarial attacks. The authors establish a unified and controllable attack framework that manipulates document structure (bounding boxes), textual content, and visual features (pixels) simultaneously or in isolation, providing the first scenario-driven evaluation suite tailored to the specific vulnerabilities of VDU systems.
Context and Motivation
OCR-based VDU models, particularly LayoutLMv2, LayoutLMv3, DocFormer, ERNIE-Layout, and GeoLayoutLM, have become foundational in downstream document intelligence tasks such as information extraction and visual question answering. These models integrate text, layout, and image modalities, leveraging the spatial structure extracted via OCR (typically bounding boxes and text lines), and are prevalent in enterprise settings where auditability is critical. While prior work explored distributional shifts, natural image corruptions, or limited forms of adversarial text noise, this paper addresses the gap of jointly attacking all input modalities under explicit budget constraints and realistic document scenarios.
Methodology
The authors' attack pipeline formalizes a threat model in which adversarial perturbations are applied along three channels—layout, text, and pixel. Attackers are bounded by interpretable budgets derived from:
- Layout budget: Minimum Intersection over Union (IoU) between original and perturbed bounding boxes (typically τ ∈ {0.9, 0.75, 0.6}), ensuring layout plausibility.
- Text budget: Maximum character replacement rate per span (edit_rate ∈ {0, 0.1}), preserving token positions and avoiding insertions or deletions.
- Pixel budget: Choice and magnitude of document-specific visual corruptions (e.g., blur, noise, shadow) from the RoDLA suite, applied only within shifted OCR boxes.
A key technical contribution is the integration of a differentiable bounding box (BBox) predictor, allowing for gradient-based adversarial search (via Projected Gradient Descent, PGD) over bounding box embeddings, despite their discretized or encoded representation in popular VDU architectures.
Attack Scenarios
Six attack scenarios are defined, each exploring single or compound channel attacks at both word and line granularity:
- Bounding box shift only
- Bounding box shift + pixel translation
- Bounding box shift + pixel translation + augmentation
- Text mutation only
- Bounding box shift + text mutation
- Bounding box shift + pixel translation + text mutation
The evaluation spans four benchmarks: FUNSD, CORD, SROIE, and DocVQA.
Experimental Results
Attack Effectiveness and Model Vulnerability
The analysis reveals several notable findings:
- Compound multimodal attacks (Scenario 6: BBox + Pixel + Text) yield the highest degradation, with maximum F₁ drops up to 29.18%.
- PGD-based layout attacks consistently surpass random-shift methods across all settings, even under tight IoU budgets (e.g., >13% F₁ drop at IoU ≥ 0.6, versus <8% for random).
- Line-level attacks consistently induce more severe errors than word-level attacks. For compound attacks, the line-minus-word performance gap reaches up to 21.4pp, attributable to greater contextual disruption.
- Text attacks using Unicode diacritics are substantially more effective than random character replacements (up to 22.4% F₁ drop). This highlights the sensitivity of VDU models to visually confusable glyph alterations, a pragmatic attack surface given real-world adversarial conditions.
Transferability
- PGD-generated adversarial perturbations are highly transferable—attacks crafted for LayoutLMv3 degrade the performance of other models (LayoutLMv2, GeoLayoutLM, ERNIE) even when the target models have distinct architectures or different input modalities.
- Even models lacking an explicit visual modality (e.g., LayTextLLM, LLaMA3) exhibit non-trivial performance drops under bounding box PGD attacks, emphasizing the universality of the identified vulnerabilities.
Ablation Insights
- Tightening the IoU constraint reduces the impact of random attacks but only marginally impairs PGD effectiveness, confirming the suitability of the differentiable BBox predictor and adversarial search for bounded perturbation regimes.
- Ablations across text and pixel budgets confirm that the attack effectiveness primarily stems from layout and semantic confusion, with visual effects serving as strong amplifiers in compound scenarios.
Implications
The results conclusively demonstrate that current SOTA OCR-based VDU models are not robust to plausible, budget-constrained, localized adversarial attacks across modalities. The observed transferability of gradient-based attacks suggests that these vulnerabilities are architectural rather than model-specific, propagating across both multimodal transformers and layout-aware LLMs.
Practical implications include:
- Deployment risk in regulated domains: Given the dominance of OCR-based pipelines in finance, law, and governance, these findings necessitate adversarial robustness evaluation before deployment, especially when audit or compliance is required.
- Benchmarking: The proposed unified framework sets a precedent for standardized robustness evaluation in the VDU domain, analogous to adversarial benchmarks in vision and NLP.
- Model defense directions: Results indicate that improving the encoding of layout information and fostering joint multi-modal regularization are promising defense avenues. Adversarial training on compound multi-modal perturbations should be explored.
- Broader impact for OCR-free models: While OCR-free architectures (e.g., Qwen-VL, LLaVA-NeXT) are not evaluated, the authors note that their proposed concept of layout budget and patch-level attacks could be extended, marking an open research direction toward spatially-grounded adversarial evaluation in patch-based VLMs.
Future Directions
This work motivates several extensions:
- Expanding to black-box and query-limited settings: Current attacks require white-box access; extending to realistic service APIs and online learning scenarios is warranted.
- Robustness in OCR-free and hybrid VDU architectures: Evaluating whether patch-based or vision-LLMs are also susceptible to analogous multi-modal attacks is of both academic and practical interest.
- Long-tailed and multilingual document settings: The attack suite could reveal further vulnerabilities where encoding and layout conventions differ from primarily English benchmarks.
Limitations
The framework is tailored to OCR-based models and does not encompass OCR-free approaches or scenarios where bounding boxes are unavailable. Real-world black-box demo deployments remain to be evaluated, which may moderate some of the observed vulnerabilities in practice but do not invalidate the potential for adversarial exploitation.
Conclusion
The unified evaluation framework and associated empirical findings establish a rigorous baseline for adversarial robustness in document AI. The demonstrated effectiveness of budget-constrained, compound, and transferable multi-modal attacks signals a need for the VDU community to systematically address robustness and generalization under adversarial conditions, both at the model and pre-processing pipeline levels. The proposed methodology provides both a research foundation and a practical testbed for future advancements in robust, layout-aware document understanding.