ICDAR 2025 DIMT Challenge

Updated 26 November 2025

The paper demonstrates that a unified LVLM backbone leveraging multi-task learning and perceptual chain-of-thought improves translation BLEU scores by up to 7 points.
DIMT Challenge is a competition advancing end-to-end document image machine translation for complex layouts using innovative models and dual OCR pipelines.
The system architecture uses a shared multimodal backbone for both OCR and direct image processing, ensuring robust and flexible translation performance.

The ICDAR 2025 DIMT Challenge, held as part of the 19th International Conference on Document Analysis and Recognition (ICDAR 2025), targets the task of end-to-end document image machine translation (DIMT) for documents with complex and diverse layouts. Huawei Translation Service Center (HW-TSC) presented a unified end-to-end solution that leverages a large vision-LLM (@@@@10@@@@), addressing both OCR-based and OCR-free DIMT tasks through a multi-task learning and perceptual chain-of-thought framework. The system architecture, training methodology, inference process, datasets, and results collectively represent an advanced approach to robust document image translation in real-world settings (Wu et al., 24 Apr 2025).

1. Unified System Architecture

The DIMT25@ICDAR2025 system is built around a single LVLM architecture, supporting both OCR-based and OCR-free translation pipelines. Both branches share the same multimodal backbone and diverge only at the text acquisition stage:

OCR-based path: The input image undergoes OCR detection and recognition, producing extracted text and detected layout regions, which are then processed by a text-guided encoder and the LVLM decoder to generate translations.
OCR-free path: The input image is processed directly by a visual encoder, which fuses vision features with query tokens via cross-modal attention. The LVLM decoder subsequently generates internal text representations and the final translation.

Both modalities employ the InternVL2.5-MPO (multi-purpose optimizer) transformer as their shared encoder–decoder core. Training occurs under a unified multi-task learning plus perceptual chain-of-thought (PCOT) regime, enabling the system to seamlessly fall back on direct image-to-translation recognition when OCR fails.

2. LVLM Base Model and Architectural Details

The underlying backbone is the InternVL2.5-MPO model [Wang et al. 2024], consisting of:

Vision Transformer encoder $E_v$ : Maps image patches to a sequence of embeddings $\mathbf{V} = E_v(\mathrm{Img}) \in \mathbb{R}^{N\times d}$ .
Causal Transformer decoder $D_t$ : At each step $t$ , attends to its previous outputs and the vision embeddings $\mathbf{V}$ .

Decoder layers contain:

Self-Attention (SA) operating on text tokens:

$\mathrm{SA}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V$

with $Q, K, V$ linearly projected from the previous layer’s output.

Cross-Modal Attention (CMA) fusing vision and text modalities:

$\mathrm{CMA}(Q_t, K_v, V_v) = \mathrm{softmax}\left(\frac{Q_t K_v^\top}{\sqrt{d}}\right)V_v$

where $Q_t$ derives from text, $K_v$ and $V_v$ from vision outputs.

Modifications are limited to re-initializing the output linear layer for the translation vocabulary and extending positional encodings to support sequences up to 8,192 tokens.

3. Multi-Task and Hierarchical Learning Framework

The system adopts a multi-task learning approach integrating three principal objectives:

OCR/text recognition: Sequence modeling for text extracted from the image.
Layout understanding: Region classification and segmentation for document structure.
Machine Translation (MT): Generation of target language output.

Each task is supervised with a cross-entropy loss, and the overall objective is a weighted sum:

$\mathcal{L} = \lambda_1\,\mathcal{L}_{OCR} + \lambda_2\,\mathcal{L}_{layout} + \lambda_3\,\mathcal{L}_{MT}$

with $\lambda_i$ tuned from $\{0.5, 1.0, 2.0\}$ .

A perceptual chain-of-thought (PCOT) protocol enforces a two-stage inference process:

Stage I (Perception): Intermediate token sequences $h^{(1)}$ are generated for recognized text under the guidance of layout and OCR modules.
Stage II (Translation Reasoning): The decoder, conditioned on $h^{(1)}$ and visual context, produces the final translated sequence $h^{(2)}$ .

A consistency regularizer ensures smooth transition between stages:

$\mathcal{L}_{cons} = \sum_{t=1}^{T} \| h_t^{(2)} - f(h_t^{(1)}) \|_2^2$

where $f$ is a shallow linear carry-over projector. The total loss is:

$\mathcal{L}_{total} = \mathcal{L} + \alpha\,\mathcal{L}_{cons}$

4. Inference, Decoding, and Post-Processing

Inference leverages minimum Bayes-risk (MBR) decoding. For each input, the model samples a candidate set $\mathcal{C}$ containing one deterministic (beam search) and ten stochastic outputs (temperature $t=0.7$ , nucleus $p=0.95$ ). The final output $\hat{y}$ is:

$\hat{y} = \arg\max_{y\in\mathcal{C}} \sum_{y'\in\mathcal{C}} \mathrm{Sim}_{BLEU}(y, y')$

which minimizes the risk under BLEU similarity.

Post-processing steps include:

Truncation of repeated special symbols to a maximum length of 10.
Removal of overly complex table transcriptions.
For Chinese output, application of Jieba tokenization and collapse of redundant spaces.

5. Dataset Composition and Training Procedure

The system is trained on the official DIMT25 datasets:

Track	Dataset	Train	Valid/Test
1	DIMT-WebDoc	300 K	1 K / 1 K
2	DIMT-arXiv	124 K	1 K / 1 K

Training images are preprocessed to $224 \times 224$ resolution and normalized per ViT standards; no synthetic augmentations are introduced. Hardware and hyperparameters:

Models: InternVL2.5-MPO at 1B and 8B parameter scales
Fine-tuning: 5 epochs on 8 × A100 GPUs using DeepSpeed ZeRO-3 offload
Effective batch: 128
Learning rate: $4 \times 10^{-5}$ with one-epoch linear warmup and cosine decay
Maximum sequence: 8,192 tokens

6. Experimental Results and Analysis

Performance is reported via BLEU scores:

Model	Test-OCR	Test-MT
InternVL2.5-1B SFT	–	67.21
+ MTL-PCOT SFT	94.63	62.16
+ MBR	96.50	64.08
+ Post-processing	97.00	66.16
InternVL2.5-8B SFT	–	72.74
+ MTL-PCOT SFT	94.89	65.32
+ MBR	97.16	68.26
+ Post-processing	97.66	70.48

Multi-task learning with PCOT improves over single-task fine-tuning by 3–4 BLEU points; MBR and post-processing add a further 2–5 BLEU. The 8B-parameter model consistently outperforms 1B. Error analysis indicates that symbol over-translation and table cell merging are primary failure modes, addressed by truncation and skipping of complex tables, respectively.

A representative case demonstrates that SFT-only models produce excessive repetition for lines (e.g., “——————”), while the final MTL-PCOT solution correctly truncates, yielding accurate paragraph alignment.

7. Advances, Limitations, and Prospects

The system demonstrates that a single LVLM backbone, when trained under a unified multi-task and perceptual hierarchical regime and decoded with MBR and targeted post-processing, can handle both OCR-based and OCR-free machine translation tasks for documents with complex layouts.

Identified limitations include incomplete handling of intricate table structures and room for better cross-modal alignment. Future enhancements are directed toward integrating learned table-structure parsing, employing contrastive layout pre-training, and extending chain-of-thought modules with explicit saliency modeling to address challenges in highly cluttered document images (Wu et al., 24 Apr 2025).

Markdown Report Issue Upgrade to Chat

References (1)

DIMT25@ICDAR2025: HW-TSC's End-to-End Document Image Machine Translation System Leveraging Large Vision-Language Model (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ICDAR 2025 DIMT Challenge.