DeepEncoder V2: Causal Visual Encoding

Updated 29 January 2026

The paper introduces a causal, content-driven reordering mechanism that transforms 2D visual information into a semantically ordered 1D sequence for improved OCR and decoding.
It employs a dual-stream Transformer architecture with flow queries and a custom block-structured attention mask to facilitate autoregressive downstream processing.
Empirical results show notable gains in accuracy (+3.73%) and reduced edit distances, demonstrating the model’s effectiveness in parsing complex visual documents.

DeepEncoder V2 is an LLM-style visual encoder introduced in DeepSeek-OCR 2, designed to dynamically reorder visual tokens based on image semantics, specifically enabling a causal, content-driven sequence analogous to human visual fixations. Unlike conventional raster-order processing in VLMs, which constrains token order to spatial traversal (top-left to bottom-right), DeepEncoder V2 incorporates causal reasoning to facilitate a reordering process that better reflects the logical and semantic hierarchy of complex visual documents. This mechanism is central to the two-cascaded 1D causal reasoning architecture, which aims to make 2D image understanding tractable via sequential Transformer-based methods. It replaces prior CLIP/ViT-based extractors with a unified Transformer encoder that outputs a causally-ordered visual sequence for downstream LLM decoders, yielding significant improvements in reading-order and logical structure parsing compared to traditional methods (Wei et al., 28 Jan 2026).

1. Architectural Design and Relation to Prior Encoders

DeepEncoder V2 is situated within the encoder–decoder paradigm of DeepSeek-OCR 2, acting as the primary module for preprocessing visual tokens before LLM decoding. In a departure from prior two-stage encoders (e.g., the CLIP-ViT backbone plus Q-Former of DeepEncoder V1), DeepEncoder V2 employs a single, decoder-only Transformer backbone (Qwen2-0.5B) augmented with dual input streams:

A vision tokenizer $\mathcal{E}$ (SAM-Base combined with convolutional layers) that produces $m$ visual tokens $V \in \mathbb{R}^{m \times d}$ .
$n$ learnable “causal flow queries” $Q_0 \in \mathbb{R}^{n \times d}$ appended as a suffix.

These are concatenated and processed by a Transformer with $L$ layers and a custom block-structured attention mask $M$ . Only the causal flow query outputs $Z \in \mathbb{R}^{n \times d}$ are routed to the downstream LLM decoder. Setting $n = m$ provides full capacity for the model to refixate each visual token, with the output sequence order determined by semantics rather than spatial layout.

Relative to CLIP, ViT, and DETR-style Q-Formers, DeepEncoder V2 differs via: (1) a single-stage Transformer, (2) dual-stream attention (visual tokens bidirectionally, flow tokens causally with cross-attention to visuals), and (3) cascaded 1D reasoning—reordering 2D input into a causally-ordered 1D stream for autoregressive downstream processing.

2. Causal Reasoning Modules: Structure and Mathematical Formalism

The causal reasoning module is designed to emulate human visual fixations, where each new "glimpse" is conditioned on previously attended semantic content rather than physical proximity. The construction is as follows:

Let $V$ (visual embeddings, $m \times d$ ) participate in standard bidirectional self-attention.
Let $Q_0$ (flow queries, $n \times d$ ) interact causally via a lower-triangular mask, with full cross-attention to all visual tokens.

The combined sequence $X_0 = [V; Q_0] \in \mathbb{R}^{(m+n) \times d}$ is processed by the Transformer with a block-causal attention mask $M \in \{0,1\}^{(m+n) \times (m+n)}$ , given by:

$M = \begin{bmatrix} \mathbf{1}_{m\times m} & \mathbf{0}_{m\times n} \ \mathbf{1}_{n\times m} & \mathrm{LowerTri}(n) \end{bmatrix}$

where $\mathrm{LowerTri}(n)_{ij} = 1$ if $i \geq j$ , $0$ otherwise.

The visual tokens can freely attend to one another, flow queries can attend to all visuals and previous flow queries, but back-attention from queries to visuals or future queries is masked out. After $L$ layers, the reordered embedding sequence is extracted as $Z = X_L[m+1 : m+n]$ . Each causal flow query thus encodes a learned position in the sequence, transforming the spatial 2D layout to a semantically meaningful 1D causal order.

3. Dynamic Visual-Token Reordering Algorithm

The algorithm for semantic-driven reordering is formalized as:

V = ℰ(I)                    # V ∈ ℝ^{m×d}
X⁰ = concat(V, Q₀)          # X⁰ ∈ ℝ^{(m+n)×d}
for ℓ in 1..L:
    X^ℓ = TransformerLayer(X^{ℓ-1}; mask=M)
Z = X^L[m+1 : m+n]          # Z ∈ ℝ^{n×d}
return Z

Each flow query at position $i$ selects, via causal attention, which visual features to aggregate based on the semantic context up to step $i$ , resulting in a flexible and data-driven reordering of the visual token sequence.

4. Positional Encoding and Sequence Adaptation

DeepEncoder V2 leverages rotary positional embeddings (RoPE) inherited from Qwen2-0.5B, applied to both keys and queries as:

$\widetilde{Q}_{h,p} = Q_{h,p} R(p), \quad \widetilde{K}_{h,p} = K_{h,p} R(p)$

where $R(p)$ denotes a block-diagonal rotation parameterized by the 1D index $p$ .

Critical to the architecture, there is no introduction of 2D spatial positional encoding; instead, the effective 1D positions of the flow queries in the reordered output $Z$ directly encode the learned scanpath. This preserves a coherent, strictly causal ordering aligned with the target autoregressive decoding trajectory.

5. Optimization Objectives and Training Procedures

DeepSeek-OCR 2, incorporating DeepEncoder V2, is optimized solely via next-token prediction (cross-entropy) over the combined image-text sequence:

$\mathcal{L}_{\mathrm{LM}} = -\mathbb{E}_{(I,Y)}\left[\sum_{t=1}^T \log p(y_t \mid I, y_{<t})\right]$

with $p(Y\mid I) = \prod_{t=1}^T p(y_t\mid I, y_{<t})$ . No additional auxiliary (ordering-specific) losses are introduced. Optimization is performed with AdamW, using cosine or step-decay learning schedules.

6. Experimental Settings and Empirical Results

Key experimental parameters are as follows:

Visual token budget $n = m$ ranging from 256 to 1120, employing multi-crop strategies (0–6 local crops at $768^2$ plus a global $1024^2$ ).
Datasets: 80% OCR 1.0/2.0, 20% general vision, multi-crop sampling strategy.
Benchmarked via OmniDocBench v1.5.

Performance, as presented in Table 1 of the source, is summarized below:

Model/Setting	Overall (%)	R-order ED	Text ED	Formula/CDM (%)
DeepSeek-OCR (baseline)	87.36	0.085	0.073	84.14
DeepSeek-OCR 2 (DE V2)	91.09	0.057	0.048	90.31

DeepEncoder V2 yields gains in overall accuracy (+3.73%), reduced reading-order (R-order) and text edit distances, and significant improvement on formula/CDM parsing (+6.17%). Experiments demonstrate that the causal encoder provides consistent improvements, especially in tasks sensitive to logical order.

7. Insights, Ablation Studies, and Limitations

The two-cascaded 1D reasoning pipeline decomposes complex 2D document structure first into a causally-ordered visual sequence in the encoder and then into an autoregressive textual sequence in the decoder. This suggests a plausible mechanism toward genuine 2D reasoning in Transformer-based architectures.

Ablation studies confirm that the dual-stream prefix plus causal-mask design is critical; substituting a standard cross-attention decoder (e.g., mBART-style) fails to converge, underscoring the necessity of the custom masking strategy.

Limitations are observed when the token budget (maximum 1120) is insufficient, particularly in text-dense newspaper layouts; this is partially mitigable by increasing local crops or additional training data. Additionally, the architecture’s LLM-style design hints at extensibility to omni-modal applications (e.g., text or audio tokens with modality-specific queries).

Potential future work includes exploring multi-hop reordering, where longer causal streams might better capture hierarchical structures, and extending the approach to encompass broader visual reasoning tasks beyond OCR/document parsing (Wei et al., 28 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

DeepSeek-OCR 2: Visual Causal Flow (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeepEncoder V2.