DeepEncoder V2: Causal Visual Encoding
- The paper introduces a causal, content-driven reordering mechanism that transforms 2D visual information into a semantically ordered 1D sequence for improved OCR and decoding.
- It employs a dual-stream Transformer architecture with flow queries and a custom block-structured attention mask to facilitate autoregressive downstream processing.
- Empirical results show notable gains in accuracy (+3.73%) and reduced edit distances, demonstrating the model’s effectiveness in parsing complex visual documents.
DeepEncoder V2 is an LLM-style visual encoder introduced in DeepSeek-OCR 2, designed to dynamically reorder visual tokens based on image semantics, specifically enabling a causal, content-driven sequence analogous to human visual fixations. Unlike conventional raster-order processing in VLMs, which constrains token order to spatial traversal (top-left to bottom-right), DeepEncoder V2 incorporates causal reasoning to facilitate a reordering process that better reflects the logical and semantic hierarchy of complex visual documents. This mechanism is central to the two-cascaded 1D causal reasoning architecture, which aims to make 2D image understanding tractable via sequential Transformer-based methods. It replaces prior CLIP/ViT-based extractors with a unified Transformer encoder that outputs a causally-ordered visual sequence for downstream LLM decoders, yielding significant improvements in reading-order and logical structure parsing compared to traditional methods (Wei et al., 28 Jan 2026).
1. Architectural Design and Relation to Prior Encoders
DeepEncoder V2 is situated within the encoder–decoder paradigm of DeepSeek-OCR 2, acting as the primary module for preprocessing visual tokens before LLM decoding. In a departure from prior two-stage encoders (e.g., the CLIP-ViT backbone plus Q-Former of DeepEncoder V1), DeepEncoder V2 employs a single, decoder-only Transformer backbone (Qwen2-0.5B) augmented with dual input streams:
- A vision tokenizer (SAM-Base combined with convolutional layers) that produces visual tokens .
- learnable “causal flow queries” appended as a suffix.
These are concatenated and processed by a Transformer with layers and a custom block-structured attention mask . Only the causal flow query outputs are routed to the downstream LLM decoder. Setting provides full capacity for the model to refixate each visual token, with the output sequence order determined by semantics rather than spatial layout.
Relative to CLIP, ViT, and DETR-style Q-Formers, DeepEncoder V2 differs via: (1) a single-stage Transformer, (2) dual-stream attention (visual tokens bidirectionally, flow tokens causally with cross-attention to visuals), and (3) cascaded 1D reasoning—reordering 2D input into a causally-ordered 1D stream for autoregressive downstream processing.
2. Causal Reasoning Modules: Structure and Mathematical Formalism
The causal reasoning module is designed to emulate human visual fixations, where each new "glimpse" is conditioned on previously attended semantic content rather than physical proximity. The construction is as follows:
- Let (visual embeddings, ) participate in standard bidirectional self-attention.
- Let (flow queries, ) interact causally via a lower-triangular mask, with full cross-attention to all visual tokens.
The combined sequence is processed by the Transformer with a block-causal attention mask , given by:
where if , $0$ otherwise.
The visual tokens can freely attend to one another, flow queries can attend to all visuals and previous flow queries, but back-attention from queries to visuals or future queries is masked out. After layers, the reordered embedding sequence is extracted as . Each causal flow query thus encodes a learned position in the sequence, transforming the spatial 2D layout to a semantically meaningful 1D causal order.
3. Dynamic Visual-Token Reordering Algorithm
The algorithm for semantic-driven reordering is formalized as:
1 2 3 4 5 6 |
V = ℰ(I) # V ∈ ℝ^{m×d} X⁰ = concat(V, Q₀) # X⁰ ∈ ℝ^{(m+n)×d} for ℓ in 1..L: X^ℓ = TransformerLayer(X^{ℓ-1}; mask=M) Z = X^L[m+1 : m+n] # Z ∈ ℝ^{n×d} return Z |
Each flow query at position selects, via causal attention, which visual features to aggregate based on the semantic context up to step , resulting in a flexible and data-driven reordering of the visual token sequence.
4. Positional Encoding and Sequence Adaptation
DeepEncoder V2 leverages rotary positional embeddings (RoPE) inherited from Qwen2-0.5B, applied to both keys and queries as:
where denotes a block-diagonal rotation parameterized by the 1D index .
Critical to the architecture, there is no introduction of 2D spatial positional encoding; instead, the effective 1D positions of the flow queries in the reordered output directly encode the learned scanpath. This preserves a coherent, strictly causal ordering aligned with the target autoregressive decoding trajectory.
5. Optimization Objectives and Training Procedures
DeepSeek-OCR 2, incorporating DeepEncoder V2, is optimized solely via next-token prediction (cross-entropy) over the combined image-text sequence:
with . No additional auxiliary (ordering-specific) losses are introduced. Optimization is performed with AdamW, using cosine or step-decay learning schedules.
6. Experimental Settings and Empirical Results
Key experimental parameters are as follows:
- Visual token budget ranging from 256 to 1120, employing multi-crop strategies (0–6 local crops at plus a global ).
- Datasets: 80% OCR 1.0/2.0, 20% general vision, multi-crop sampling strategy.
- Benchmarked via OmniDocBench v1.5.
Performance, as presented in Table 1 of the source, is summarized below:
| Model/Setting | Overall (%) | R-order ED | Text ED | Formula/CDM (%) |
|---|---|---|---|---|
| DeepSeek-OCR (baseline) | 87.36 | 0.085 | 0.073 | 84.14 |
| DeepSeek-OCR 2 (DE V2) | 91.09 | 0.057 | 0.048 | 90.31 |
DeepEncoder V2 yields gains in overall accuracy (+3.73%), reduced reading-order (R-order) and text edit distances, and significant improvement on formula/CDM parsing (+6.17%). Experiments demonstrate that the causal encoder provides consistent improvements, especially in tasks sensitive to logical order.
7. Insights, Ablation Studies, and Limitations
The two-cascaded 1D reasoning pipeline decomposes complex 2D document structure first into a causally-ordered visual sequence in the encoder and then into an autoregressive textual sequence in the decoder. This suggests a plausible mechanism toward genuine 2D reasoning in Transformer-based architectures.
Ablation studies confirm that the dual-stream prefix plus causal-mask design is critical; substituting a standard cross-attention decoder (e.g., mBART-style) fails to converge, underscoring the necessity of the custom masking strategy.
Limitations are observed when the token budget (maximum 1120) is insufficient, particularly in text-dense newspaper layouts; this is partially mitigable by increasing local crops or additional training data. Additionally, the architecture’s LLM-style design hints at extensibility to omni-modal applications (e.g., text or audio tokens with modality-specific queries).
Potential future work includes exploring multi-hop reordering, where longer causal streams might better capture hierarchical structures, and extending the approach to encompass broader visual reasoning tasks beyond OCR/document parsing (Wei et al., 28 Jan 2026).