DTrOCR: Transformer-Based Optical Character Recognition

Updated 18 May 2026

DTrOCR is a unified OCR model that leverages a decoder-only Transformer pre-trained as a generative language model to convert image patches into text.
It employs a two-stage training process—unsupervised pre-training on synthetic data followed by supervised fine-tuning on diverse real-world benchmarks.
Empirical results show DTrOCR outperforms traditional encoder-decoder models in accuracy and efficiency, streamlining the OCR pipeline.

Optical Character Recognition (DTrOCR) is a state-of-the-art approach for recognizing printed, handwritten, and scene text that departs from the conventional encoder-decoder paradigm by leveraging a decoder-only Transformer architecture pre-trained as a generative LLM. DTrOCR embeds images as visual patch sequences, processes them in a unified autoregressive manner with text tokens, and demonstrates superior performance across multilingual and multi-domain benchmarks in optical character recognition (Fujitake, 2023).

1. Architectural Framework

DTrOCR consists of two principal components: (1) a patch-embedding frontend and (2) a stack of GPT-style Transformer decoder blocks. Input images $I \in \mathbb{R}^{W \times H \times 3}$ are resized (e.g., to 128×32), tiled into non-overlapping patches (e.g., 8×4), and each patch is flattened and linearly projected to a $d$ -dimensional embedding,

$x_{\text{patch}_j} = W_{\text{proj}} \cdot \mathrm{vec}(\text{Patch}_j) + b_{\text{proj}}, \quad j = 1, \ldots, N.$

A sinusoidal or learned position embedding $e_{\text{pos}_j} \in \mathbb{R}^d$ is added, yielding a sequence

$z_0 = [x_{\text{patch}_1} + e_{\text{pos}_1},\, \ldots,\, x_{\text{patch}_N} + e_{\text{pos}_N}].$

The Transformer core is a decoder-only architecture based on GPT-2 small: 12 layers, hidden size $d=768$ , 12 attention heads ( $d_k=64$ per head). Each layer incorporates:

Masked multi-head self-attention:

$\mathrm{Attention}(Q, K, V) = \mathrm{Softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right)V$

$\mathrm{MultiHead}(X) = [\mathrm{head}_1;\ldots;\mathrm{head}_H] W^O$

Position-wise feed-forward module:

$\mathrm{FFN}(x) = \mathrm{GeLU}(xW_1 + b_1) W_2 + b_2$

Layer normalization and residual connections before and after each sub-block.

Only the patch embedding layer functions as an "encoder"; there is no distinct visual transformer or CNN encoder as in prior methods (e.g., TrOCR (Li et al., 2021)). The input to the Transformer decoder is the patch sequence plus a [SEP] delimiter, followed by autoregressive generation of text tokens from a shared byte-pair token vocabulary. This unified modeling avoids separate image-text cross-attention, leveraging a single self-attention space for image patches and text.

Relative positional encodings (Transformer-XL style) enable generalization to longer patch sequences, and beam search is applied during inference until an [EOS] is produced.

2. Training Regime and Pre-training Strategy

DTrOCR adopts a two-stage training procedure: (1) unsupervised pre-training on synthetic image-text pairs and (2) supervised fine-tuning on real-world OCR benchmarks.

Pre-training:

Synthetic corpora: For English – PILE (800 GB), CC100; for Chinese – Open Chinese NLP corpus.
Synthetic image rendering:
- Scene text: SynthTIGER (6B images)
- Printed text: MJSynth, SynthText, TextRender (2B+ images)
- Handwriting: TRDG with thousands of fonts (2B images)
Objective: Next-token autoregressive prediction. For a patch sequence $d$ 0 and ground truth target tokens $d$ 1,

$d$ 2

Fine-tuning:

Dataset composition:
- Scene-text: Synthetic (MJSynth, SynthText) + real (COCO-Text, RCTW, Uber-Text, ArT, LSVT, MLT19, ReCTS)
- Printed receipts: SROIE Task 2
- Handwriting: IAM handwriting dataset
- Chinese Text Recognition (CTR): Four subsets totaling ∼1.4M images
Preprocessing and augmentation: normalization ( $d$ 3), resizing, RandAugment (excluding sharpness), Gaussian blur, Poisson noise, color inversion, and rotations ( $d$ 4)
Loss: Standard cross-entropy as in pretraining phase

3. Benchmark Results and Comparative Evaluation

DTrOCR displays substantial improvements over state-of-the-art models across diverse document domains and languages.

Task / Dataset	DTrOCR Performance	Previous SOTA & Reference
IIIT5K (Eng Text)	98.4% (synthetic)	PARSeq 97.0%, MaskOCR-large 96.5
SVT	96.9% (synthetic)	PARSeq 93.6%
IC13	98.8% (synthetic)	PARSeq 97.0%
SVTP	95.0% (synthetic)	PARSeq 88.9%
CUTE	97.6% (synthetic)	PARSeq 92.2%
SROIE (Receipts)	F1 98.37	TrOCR-Large 96.58
IAM (Handwriting)	CER 2.38%	TrOCR-Large 2.89%
Chinese Scene Text	87.4%	MaskOCR-large 76.2%
Chinese Web Text	89.7%	MaskOCR-large 76.8%
Chinese Handwriting	81.4%	MaskOCR-large 67.9%

Fine-tuning on real plus synthetic data further improves results (IIIT5K: 99.6%, SVT: 98.9%, etc.), outperforming previously established benchmarks (Fujitake, 2023).

Ablation studies highlight that the decoder-only GPT-2 architecture, pretrained on generative language modeling, achieves the highest performance for scene and close-to-best for Chinese text recognition, compared to encoder-decoder and vision-enhanced variants. Performance scales with unique pretraining samples rather than repeated epochs on less diverse data, and with increased decoder size (e.g., 97.7% for GPT-2 small, 98.3% for GPT-2 large on English scene text).

4. Analysis: Architectural and Practical Considerations

The decoder-only design introduces several architectural and operational benefits:

Unified sequence modeling: By treating image patches and text tokens as elements of a shared sequence processed by self-attention, DTrOCR circumvents encoder-decoder cross-attention, simplifying the computational graph.
Autoregressive language modeling: Directly benefits from large-scale generative pretraining, enabling robust handling of ambiguous or occluded visual inputs by leveraging long-range context from language modeling.
Pipeline simplicity: In contrast to encoder–decoder architectures (e.g., TrOCR (Li et al., 2021)), DTrOCR eliminates the need for separate vision encoders, external LLM rescoring, or Connectionist Temporal Classification (CTC), reducing design and parameterization complexity.
Computational efficiency: Fine-tuning directly from public GPT checkpoints is feasible, lowering computational barriers.

Limitations and Future Directions:

The approach is data-intensive, necessitating billions of synthetic pairs for high accuracy, which may incur significant resource costs.
Pure patch-based input under-represents fine visual structure (e.g., thin strokes relevant for some scripts); hybrid embeddings or adaptive patch sizes are suggested as future improvements.
Inference throughput is moderate on modern hardware (e.g., ~98 FPS on RTX2080Ti for CTR); sparsity-aware attention and quantization are proposed for acceleration.
Extensibility to multi-modal pre-training, larger GPT derivatives, dynamic patch-token mapping, and real-time/mobilized deployment are identified research avenues.

5. Comparison to Encoder–Decoder Transformer OCR Models

TrOCR (Li et al., 2021) exemplifies the conventional encoder–decoder paradigm, employing a Vision Transformer (ViT) encoder for image understanding and a text Transformer (e.g., RoBERTa) decoder for sequence modeling. The primary difference is the presence of explicit encoder–decoder cross-attention, allowing dynamic interaction between vision features and partly decoded text via:

$d$ 5

This design allows the decoder to attend dynamically to the entire visual context at each step. TrOCR and its descendants achieve strong results but introduce more parameters and complexity. Ablation studies in DTrOCR demonstrate that, given sufficient pre-training, decoder-only architectures can match or surpass encoder–decoder frameworks in both accuracy metrics and architectural efficiency (Fujitake, 2023).

6. Operational Scope and Applications

DTrOCR is validated on a broad spectrum of text recognition scenarios:

Printed/scene text and non-Latin scripts (e.g., Chinese)
Handwritten text recognition (contemporary and variable-genre corpora)
Multi-line text, occluded/irregular documents, and complex layouts

Its end-to-end structure (direct image-to-text), lack of external language modeling or CTC, and extensibility across diverse visual-linguistic domains position it as a generalizable OCR backbone.

7. Prospective Directions and Extensions

DTrOCR’s framework enables further exploration:

Integration of multi-modal objectives (joint vision–language pretraining)
Diversifying input patch representations (hybrid or dynamic strategies)
Real-time deployment strategies (quantization, model distillation, lightweight variants)
Unified detection–recognition (e.g., DETR-style extensions) for end-to-end document analysis pipelines
Application to low-resource languages via self-supervised pretraining and domain-adaptive fine-tuning

A plausible implication is that the barrier between vision and language modeling in OCR can be further eroded, moving towards architectures that treat all input and output modalities as elements in a single, unified generative sequence (Fujitake, 2023, Li et al., 2021).

Markdown Report Issue Upgrade to Chat

References (2)

DTrOCR: Decoder-only Transformer for Optical Character Recognition (2023)

TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Optical Character Recognition (DTrOCR).

DTrOCR: Transformer-Based Optical Character Recognition

1. Architectural Framework

2. Training Regime and Pre-training Strategy

3. Benchmark Results and Comparative Evaluation

4. Analysis: Architectural and Practical Considerations

5. Comparison to Encoder–Decoder Transformer OCR Models

6. Operational Scope and Applications

7. Prospective Directions and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DTrOCR: Transformer-Based Optical Character Recognition

1. Architectural Framework

2. Training Regime and Pre-training Strategy

3. Benchmark Results and Comparative Evaluation

4. Analysis: Architectural and Practical Considerations

5. Comparison to Encoder–Decoder Transformer OCR Models

6. Operational Scope and Applications

7. Prospective Directions and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research