MinerU-Diffusion: Vision-Conditioned OCR

Updated 4 July 2026

MinerU-Diffusion is a diffusion-based OCR framework that redefines document parsing as inverse rendering, jointly decoding text, layout, tables, and formulas.
It uses a block-wise diffusion decoder to partition the output sequence for parallel refinement, reducing latency and mitigating error propagation compared to autoregressive methods.
The framework achieves up to 3.2x faster decoding and enhanced robustness under semantic perturbations through an uncertainty-driven curriculum learning strategy.

Searching arXiv for MinerU-Diffusion and closely related OCR baselines to ground the article in current papers. MinerU-Diffusion is a diffusion-based, vision-conditioned document OCR framework that treats long-form document parsing as inverse rendering rather than left-to-right text generation. It is designed to recover unified structured sequences containing text, layout, tables, and formulas from page images or cropped regions, replacing autoregressive decoding with parallel diffusion denoising under visual conditioning. The central claim is that document OCR is not intrinsically causal: serialization into a 1D token stream is a modeling convenience, whereas the underlying dependencies are induced by 2D structure, local visual evidence, and region-level constraints. On this basis, MinerU-Diffusion employs a block-wise diffusion decoder and an uncertainty-driven curriculum learning strategy, and reports up to $3.2\times$ faster decoding than autoregressive baselines while improving robustness, particularly when semantic priors are perturbed in the proposed Semantic Shuffle benchmark (Dong et al., 23 Mar 2026).

1. Problem formulation and inverse-rendering perspective

MinerU-Diffusion is motivated by three failure modes of autoregressive document OCR. First, sequential latency scales linearly with output length, so token-by-token decoding is slow when a page serializes into thousands of tokens, as in multi-column text, nested tables, or lengthy formulas. Second, early recognition errors become conditioning context for subsequent generation, producing error propagation over long outputs. Third, autoregressive models entangle recognition with language modeling, which is particularly problematic in tables and formulas, where the correct output is visually grounded and highly local; under ambiguity or semantic perturbation, the model may hallucinate plausible text or incorrect structures (Dong et al., 23 Mar 2026).

The inverse-rendering formulation asserts that OCR should be understood as reconstruction of a structured token field from pixels. In this view, a document page is not a left-to-right causal process, and the reading order imposed by serialization is not equivalent to the dependency structure of the task. The framework therefore treats decoding as iterative global refinement under image conditioning, with many positions updated in parallel rather than committed sequentially. This perspective is especially consequential for structured documents, because layout boundaries, table cells, and mathematical symbols often require simultaneous resolution of local visual constraints rather than strong dependence on preceding linguistic context.

A related misconception is that long-document OCR necessarily benefits from stronger language completion. MinerU-Diffusion explicitly argues the opposite for many document elements: when semantic regularity is weak, disrupted, or misleading, stronger visual grounding is preferable to stronger autoregressive priors. The Semantic Shuffle benchmark is introduced precisely to probe this distinction.

2. Unified output representation and visual conditioning

The system uses a high-level pipeline with three stages. A vision encoder first produces native-resolution visual tokens from the page or cropped regions. A diffusion-based decoder then reconstructs a unified structured sequence under visual conditioning. Finally, confidence-guided parallel decoding confirms high-confidence predictions while retaining ambiguous positions in masked form for further refinement.

The output representation is a single sequence over a unified vocabulary $\mathcal{V}$ that encodes heterogeneous document elements. Layout is represented by tags such as <|box_start|> x1 y1 x2 y2 <|box_end|>, <|ref_start|> category <|ref_end|>, and orientation markers such as <|rotate_up|>. Tables use OTSL tokens including <fcel> for filled cells, <ecel> for empty cells, <ucel> for unknown cells, and <nl> for row breaks. Formulas are emitted as plain LaTeX strings. Text is serialized as raw characters or words in reading order. This single-sequence interface allows the model to parse text, layout, table structure, and mathematical notation jointly rather than with separate specialized decoders.

Visual conditioning is realized by multimodal attention. The decoder attends to image tokens and prompt tokens throughout decoding, grounding denoising in pixels rather than in linguistic priors alone. In implementation terms, textual query positions cross-attend to visual keys and values, and native-scale visual features are preserved rather than collapsed into aggressively downsampled representations. This design is consistent with the paper’s argument that document OCR is better modeled as structured reconstruction from visual evidence than as causal language generation (Dong et al., 23 Mar 2026).

3. Block-wise diffusion decoder

The core decoder partitions the output sequence $y \in \mathcal{V}^L$ into $B$ contiguous blocks of equal length $L'$ , with $L = B \cdot L'$ and

$y = \bigl(y^{(1)}, \ldots, y^{(B)}\bigr), \qquad y^{(b)} \in \mathcal{V}^{L'}.$

Generation is factorized across blocks but refined in parallel within each block: $p_\theta(y \mid x) = \prod_{b=1}^{B} p_\theta\bigl(y^{(b)} \mid y^{(<b)}, x\bigr),$

$p_\theta\bigl(y^{(b)}_{t-1} \mid y^{(b)}_t, y^{(<b)}, x\bigr).$

The paper describes this as “self-regressive across blocks, diffusion within blocks.” The cross-block causal chain provides coarse positional anchoring and long-range alignment, while the within-block bidirectional denoising permits parallel refinement of many tokens at once (Dong et al., 23 Mar 2026).

The decoder is a transformer equipped with a structured attention mask $\mathcal{M}_{ij}$ : $\mathcal{V}$ 0

$\mathcal{V}$ 1

$\mathcal{V}$ 2

This means full bidirectional attention within a block, causal attention to all previous blocks, and no attention to future blocks. Relative to full attention, complexity is reduced from $\mathcal{V}$ 3 to $\mathcal{V}$ 4, and the causal block chain enables KV-cache reuse at inference.

Inference proceeds by masked categorical denoising. The target sequence is initialized with all positions masked. For each block $\mathcal{V}$ 5, the model performs repeated denoising steps. At each step it predicts distributions over masked positions, computes a confidence score $\mathcal{V}$ 6, confirms positions whose confidence exceeds a threshold $\mathcal{V}$ 7, and leaves others masked for further passes. By default, the dynamic schedule uses $\mathcal{V}$ 8, with top- $\mathcal{V}$ 9, temperature $y \in \mathcal{V}^L$ 0, and top- $y \in \mathcal{V}^L$ 1. Static schedules, such as fixed step counts of 6 or 32, are also supported. Lower $y \in \mathcal{V}^L$ 2 increases parallelism and throughput; very high $y \in \mathcal{V}^L$ 3 approaches token-by-token decoding and reduces efficiency.

The architectural choice is not merely computational. The paper reports that full-attention diffusion suffers from quadratic cost, positional instability, and repetition under length mismatch, whereas block attention reduces memory and latency, improves scalability and stability, and mitigates repetition. This makes the block structure both an efficiency mechanism and a regularizer for long-sequence OCR.

4. Discrete diffusion formulation and uncertainty-driven curriculum

MinerU-Diffusion uses masked discrete diffusion rather than continuous Gaussian DDPM. The time-agnostic forward corruption of a clean sequence $y \in \mathcal{V}^L$ 4 is

$y \in \mathcal{V}^L$ 5

A discrete-time forward process is written as

$y \in \mathcal{V}^L$ 6

where $y \in \mathcal{V}^L$ 7 is a monotonically increasing mask schedule. Reverse denoising within a block predicts clean tokens for masked positions, conditioned on visual features and previously confirmed blocks: $y \in \mathcal{V}^L$ 8 Training reduces to cross-entropy on masked positions with random $y \in \mathcal{V}^L$ 9 or discrete steps, plus task prompts $B$ 0. The simplified objective is

$B$ 1

A common misunderstanding is to map this directly onto image-style Gaussian diffusion. The paper explicitly states that no $B$ 2-prediction is used, that the decoder predicts categorical logits for masked tokens, and that classifier-free guidance is not reported; conditioning is achieved through multimodal attention to visual tokens and prompts (Dong et al., 23 Mar 2026).

Training uses a two-stage curriculum to address any-order optimization sensitivity and label noise. Stage I, termed diversity-driven foundational learning, trains on $B$ 3 sampled from a diverse high-entropy distribution $B$ 4, covering many layouts, languages, and document styles. Stage II, termed uncertainty-driven boundary refinement, mines hard cases using inference consistency over $B$ 5 stochastic inference passes: $B$ 6 where $B$ 7 is PageIoU for layout, CDM for formulas, or TEDS for tables. Hard samples are selected as

$B$ 8

then refined into high-precision labels $B$ 9. The fine-tuning set is

$L'$ 0

with weighted loss

$L'$ 1

The reported effect is smoother optimization, improved boundary precision, and better handling of residual long-tail errors.

5. Training recipe, architecture, and implementation

The training corpus is based on the MinerU2.5 corpus, approximately $L'$ 2 million samples, mainly Chinese and English document parsing, including layout-heavy, table-rich, and formula-rich pages. A Stage-0 modality-alignment phase uses LLaVA-Pretrain (550K) and LLaVA-NeXT (739K). Stage-0a trains only the MLP adaptor with sequence length 4096, batch size 256, learning rate $L'$ 3, warmup 0.03, and 1 epoch. Stage-0b makes all parameters trainable with sequence length 8192, batch size 64, learning rate $L'$ 4, warmup 0.03, and 1 epoch. Stage-1 on $L'$ 5 uses sequence length 12288, batch size 256, learning rate $L'$ 6, warmup 0.03, 9 epochs, and document augmentations. Stage-2 on $L'$ 7 uses sequence length 16384, batch size 256, learning rate $L'$ 8, warmup 0.1, 4 epochs, augmentations, and hard samples mixed with random replay (Dong et al., 23 Mar 2026).

The vision encoder is initialized from Qwen2-VL-7B. Native-scale handling is implemented with Patch-n-Pack/Navit-like tiling, and the visual token budget per image ranges from 4 to 2048 tokens. The diffusion decoder uses an SDAR-1.7B-Chat-b32 backbone with block size 32, for a total system size of approximately 2.5B parameters. Compared with MinerU2.5, M-RoPE is removed. The transformer uses the block attention mask described above, and multimodal cross-attention conditions decoding on the vision tokens.

Data augmentation includes geometric transformations such as scaling, grid distortion, and rotation; background perturbations such as textures, watermarking, shadows, and scanlines; photometric changes including brightness, contrast, illumination, and RGB shift; and degradation operators such as PSF blur, Gaussian blur, and light erosion or dilation. Layout samples avoid spatial augmentation that would corrupt geometry. Page-level layout detection and crop-level recognition are trained jointly in a unified model, and inputs are resized only if the visual token budget would otherwise be exceeded.

The reported training recipe is intended to be reproducible. The paper provides code at https://github.com/opendatalab/MinerU-Diffusion and a model checkpoint at https://huggingface.co/opendatalab/MinerU-Diffusion-V1-0320-2.5B.

6. Empirical performance, robustness, and limitations

On OmniDocBench v1.5 without GT Layout, MinerU-Diffusion reports Overall 88.94, Text edit distance 0.061, Formula CDM 86.41, Table TEDS 86.50, Table TEDS-S 90.29, and Reading Order 0.059. With GT Layout, it reports Overall 93.37, Text 0.028, Formula 91.92, Table TEDS 91.00, and TEDS-S 94.86. The paper states that this is competitive with top autoregressive systems, citing MinerU2.5 at Overall 93.44 and PaddleOCR-VL at 93.91 under the recognition-focused setting with GT Layout. For page-type breakdown without GT Layout, text edit distance is reported as 0.0440 on Slides, 0.0244 on Academic Papers, 0.0444 on Book, 0.0677 on Textbook, 0.0580 on Exam Papers, 0.0474 on Magazine, 0.0863 on Newspaper, 0.1618 on Notes, and 0.0104 on Financial Report (Dong et al., 23 Mar 2026).

On element-specific tasks, table performance is reported as TEDS 73.77 and TEDS-S 82.06 on CC-OCR, and TEDS 81.18 and TEDS-S 88.66 on OCRBench v2. Formula recognition on UniMER-Test is reported as CPE 91.6, HWE 91.6, SCE 92.0, and SPE 96.8. These results support the claim that the unified sequence interface can handle heterogeneous document elements without a separate decoder per subtask.

The speed results are central to the method’s positioning. At matched accuracy of approximately 52 TPS against MinerU2.5, MinerU-Diffusion achieves 108.9 TPS at 93%+ accuracy with $L'$ 9, corresponding to approximately $L = B \cdot L'$ 0 speedup. Peak acceleration is reported as approximately $L = B \cdot L'$ 1 with $L = B \cdot L'$ 2 at greater than 90% accuracy, while the front-panel figure reports up to $L = B \cdot L'$ 3 speedup. The reported accuracy-efficiency trade-off curve gives $L = B \cdot L'$ 4 speedup at 99.9% relative accuracy and $L = B \cdot L'$ 5 at 98.8% relative accuracy. In decoding-strategy ablations on OmniDocBench v1.5 with GT Layout, static step 6 yields Overall 88.31, TPS 91.56, and TPF 5.33; static step 32 yields Overall 93.02, TPS 21.86, and TPF 1.00; dynamic $L = B \cdot L'$ 6 yields Overall 93.34, TPS 98.32, and TPF 5.18. The autoregressive baselines are listed as MinerU2.5 at 51.46 TPS and PaddleOCR-VL at 40.77 TPS.

Robustness is examined through the Semantic Shuffle benchmark, constructed by shuffling words in document pages from the FOX dataset while preserving visual formatting. The reported finding is that autoregressive decoders degrade sharply as distortion increases, whereas diffusion decoding remains nearly constant across distortions. The interpretation given in the paper is that diffusion decoding is more strongly visually grounded and less dependent on language completion.

The limitations are explicit. Layout prediction remains a bottleneck, as shown by the gap between results without GT Layout and with GT Layout. Extremely long sequences and extreme layouts still challenge block scheduling and memory. Overly conservative confidence thresholds reduce throughput, whereas overly aggressive thresholds can slightly reduce accuracy. Rare symbols and highly complex formulas or tables may require specialized tokenization or local structural constraints. Although block attention lowers complexity to $L = B \cdot L'$ 7 and enables KV caching, high-resolution visual tokens and long sequences still require substantial GPU memory. The paper identifies better layout priors, block-size adaptation, dynamic block partitioning, structure-aware losses for tables and formulas, hybrid SDAR-style interpolation between autoregression and diffusion, multilingual extension, and specialized symbol vocabularies as plausible future directions (Dong et al., 23 Mar 2026).

Markdown Report Issue Upgrade to Chat

References (1)

MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MinerU-Diffusion.