ICDAR 2025 DIMT Challenge
- The paper demonstrates that a unified LVLM backbone leveraging multi-task learning and perceptual chain-of-thought improves translation BLEU scores by up to 7 points.
- DIMT Challenge is a competition advancing end-to-end document image machine translation for complex layouts using innovative models and dual OCR pipelines.
- The system architecture uses a shared multimodal backbone for both OCR and direct image processing, ensuring robust and flexible translation performance.
The ICDAR 2025 DIMT Challenge, held as part of the 19th International Conference on Document Analysis and Recognition (ICDAR 2025), targets the task of end-to-end document image machine translation (DIMT) for documents with complex and diverse layouts. Huawei Translation Service Center (HW-TSC) presented a unified end-to-end solution that leverages a large vision-LLM (@@@@10@@@@), addressing both OCR-based and OCR-free DIMT tasks through a multi-task learning and perceptual chain-of-thought framework. The system architecture, training methodology, inference process, datasets, and results collectively represent an advanced approach to robust document image translation in real-world settings (Wu et al., 24 Apr 2025).
1. Unified System Architecture
The DIMT25@ICDAR2025 system is built around a single LVLM architecture, supporting both OCR-based and OCR-free translation pipelines. Both branches share the same multimodal backbone and diverge only at the text acquisition stage:
- OCR-based path: The input image undergoes OCR detection and recognition, producing extracted text and detected layout regions, which are then processed by a text-guided encoder and the LVLM decoder to generate translations.
- OCR-free path: The input image is processed directly by a visual encoder, which fuses vision features with query tokens via cross-modal attention. The LVLM decoder subsequently generates internal text representations and the final translation.
Both modalities employ the InternVL2.5-MPO (multi-purpose optimizer) transformer as their shared encoder–decoder core. Training occurs under a unified multi-task learning plus perceptual chain-of-thought (PCOT) regime, enabling the system to seamlessly fall back on direct image-to-translation recognition when OCR fails.
2. LVLM Base Model and Architectural Details
The underlying backbone is the InternVL2.5-MPO model [Wang et al. 2024], consisting of:
- Vision Transformer encoder : Maps image patches to a sequence of embeddings .
- Causal Transformer decoder : At each step , attends to its previous outputs and the vision embeddings .
Decoder layers contain:
- Self-Attention (SA) operating on text tokens:
with linearly projected from the previous layer’s output.
- Cross-Modal Attention (CMA) fusing vision and text modalities:
where derives from text, and from vision outputs.
Modifications are limited to re-initializing the output linear layer for the translation vocabulary and extending positional encodings to support sequences up to 8,192 tokens.
3. Multi-Task and Hierarchical Learning Framework
The system adopts a multi-task learning approach integrating three principal objectives:
- OCR/text recognition: Sequence modeling for text extracted from the image.
- Layout understanding: Region classification and segmentation for document structure.
- Machine Translation (MT): Generation of target language output.
Each task is supervised with a cross-entropy loss, and the overall objective is a weighted sum:
with tuned from .
A perceptual chain-of-thought (PCOT) protocol enforces a two-stage inference process:
- Stage I (Perception): Intermediate token sequences are generated for recognized text under the guidance of layout and OCR modules.
- Stage II (Translation Reasoning): The decoder, conditioned on and visual context, produces the final translated sequence .
A consistency regularizer ensures smooth transition between stages:
where is a shallow linear carry-over projector. The total loss is:
4. Inference, Decoding, and Post-Processing
Inference leverages minimum Bayes-risk (MBR) decoding. For each input, the model samples a candidate set containing one deterministic (beam search) and ten stochastic outputs (temperature , nucleus ). The final output is:
which minimizes the risk under BLEU similarity.
Post-processing steps include:
- Truncation of repeated special symbols to a maximum length of 10.
- Removal of overly complex table transcriptions.
- For Chinese output, application of Jieba tokenization and collapse of redundant spaces.
5. Dataset Composition and Training Procedure
The system is trained on the official DIMT25 datasets:
| Track | Dataset | Train | Valid/Test |
|---|---|---|---|
| 1 | DIMT-WebDoc | 300 K | 1 K / 1 K |
| 2 | DIMT-arXiv | 124 K | 1 K / 1 K |
Training images are preprocessed to resolution and normalized per ViT standards; no synthetic augmentations are introduced. Hardware and hyperparameters:
- Models: InternVL2.5-MPO at 1B and 8B parameter scales
- Fine-tuning: 5 epochs on 8 × A100 GPUs using DeepSpeed ZeRO-3 offload
- Effective batch: 128
- Learning rate: with one-epoch linear warmup and cosine decay
- Maximum sequence: 8,192 tokens
6. Experimental Results and Analysis
Performance is reported via BLEU scores:
| Model | Test-OCR | Test-MT |
|---|---|---|
| InternVL2.5-1B SFT | – | 67.21 |
| + MTL-PCOT SFT | 94.63 | 62.16 |
| + MBR | 96.50 | 64.08 |
| + Post-processing | 97.00 | 66.16 |
| InternVL2.5-8B SFT | – | 72.74 |
| + MTL-PCOT SFT | 94.89 | 65.32 |
| + MBR | 97.16 | 68.26 |
| + Post-processing | 97.66 | 70.48 |
Multi-task learning with PCOT improves over single-task fine-tuning by 3–4 BLEU points; MBR and post-processing add a further 2–5 BLEU. The 8B-parameter model consistently outperforms 1B. Error analysis indicates that symbol over-translation and table cell merging are primary failure modes, addressed by truncation and skipping of complex tables, respectively.
A representative case demonstrates that SFT-only models produce excessive repetition for lines (e.g., “——————”), while the final MTL-PCOT solution correctly truncates, yielding accurate paragraph alignment.
7. Advances, Limitations, and Prospects
The system demonstrates that a single LVLM backbone, when trained under a unified multi-task and perceptual hierarchical regime and decoded with MBR and targeted post-processing, can handle both OCR-based and OCR-free machine translation tasks for documents with complex layouts.
Identified limitations include incomplete handling of intricate table structures and room for better cross-modal alignment. Future enhancements are directed toward integrating learned table-structure parsing, employing contrastive layout pre-training, and extending chain-of-thought modules with explicit saliency modeling to address challenges in highly cluttered document images (Wu et al., 24 Apr 2025).