DeepSeek-OCR: Optical Compression for Text

Updated 22 October 2025

DeepSeek-OCR is a vision-language system that compresses long textual contexts into visual tokens using aggressive convolution and attention-based encoding.
It employs a two-stage pipeline with a DeepEncoder for visual feature extraction and a Mixture-of-Experts decoder, achieving up to 97% OCR precision at moderate compression.
The framework supports scalable, memory-efficient document processing, multilingual parsing, and advances research in neural memory and controlled forgetting.

DeepSeek-OCR is a vision-language system built to study and operationalize the optical compression of long textual contexts by representing large amounts of textual information with a highly compressed set of visual tokens. It leverages a custom optical encoder ("DeepEncoder") and a mixture-of-experts language decoder ("DeepSeek3B-MoE-A570M") to enable decoding of textual information from high-resolution images with significant compression, supporting scenarios where memory efficiency and reduced activation cost are crucial. This framework demonstrates high OCR precision at moderate compression and provides mechanisms for both research into memory and forgetting in neural architectures and for practical deployment in large-scale document processing (Wei et al., 21 Oct 2025).

1. Architectural Design

The DeepSeek-OCR framework comprises two principal subsystems:

DeepEncoder: This module ingests high-resolution document images (e.g., 1024×1024 pixels), partitions them into fine-grained patch tokens (16×16 pixel patches), and applies a two-stage processing pipeline:
- Visual perception feature extraction is performed with window attention layers using a SAM-base module (ca. 80M parameters), prioritizing local feature extraction and computational parsimony.
- Visual knowledge encoding uses a CLIP-large backbone (ca. 300M parameters), employing global dense attention on the compressed token representation.
- Convolutional compression is implemented between these stages with a two-layer 2D convolutional block (kernel: 3×3, stride: 2, padding: 1), effecting a 16× reduction in token count (e.g., 4096 patch tokens → 256).
- Multiple resolutions and modes (Tiny/Base/Large/Gundam) are supported via dynamic tiling and interpolation, enabling the model to flexibly match the token budget to input scale.
DeepSeek3B-MoE-A570M (Decoder): The decoder is a 3B parameter mixture-of-experts LLM, with only 570M parameters activated per inference (via 6-of-64 expert routing plus 2 shared experts). It receives compressed visual tokens $Z \in \mathbb{R}^{n \times d_{\text{latent}}}$ and performs nonlinear mapping to reconstruct sequences in text-token space $X̂ \in \mathbb{R}^{N \times d_{\text{text}}}$ .

The architecture is engineered to keep the number of vision tokens (activation size) tightly bounded, even with high input resolution, ensuring efficiency in both memory and compute cost.

2. Optical Compression Methodology

The key distinguishing attribute of DeepSeek-OCR is optical compression—the reduction of long text contexts into a compact set of vision tokens via convolutional and attention-based transformations. Central to this are:

Patchification and Token Downsampling: The input image, partitioned into $N$ patch tokens (e.g., $N=4096$ for 1024×1024 with 16×16 patches), is compressed to $n$ visual tokens via convolutional downsampling (base mode: $n=256$ ).
Adaptive Token Budget: The system supports dynamic adjustments of token budget and spatial resolution, with formulaic control over valid vision tokens:

$N_{\text{valid}} = \lceil N_{\text{actual}} \left[1 - \frac{\max(w, h) - \min(w, h)}{\max(w, h)} \right] \rceil$

where $w$ and $h$ denote image width and height.

Optical-to-Text Decoding: The decoder maps the compressed vision latent representation to textual outputs, effectively performing lossy-to-lossless recovery as a function of the compression ratio.

Empirical evaluation shows that, for a compression ratio $\text{(text tokens)}$ / $(\text{vision tokens})$ below $10\times$ , DeepSeek-OCR achieves $\sim97\%$ OCR decoding precision. Even at $20\times$ compression, accuracy remains near $60\%$ , illustrating graceful degradation.

3. Experimental Performance and Metrics

The DeepSeek-OCR system is quantitatively characterized through several benchmarks:

Fox Benchmark (600–1300 text tokens/document):
- With 64 vision tokens ("Tiny" mode), decoding precision reaches $\sim97\%$ at a compression ratio of $10\times$ .
- At a $20\times$ compression ratio, decoding accuracy remains $\sim60\%$ .
OmniDocBench (real-world document parsing):
- At 100 vision tokens (640×640, "Small" mode), DeepSeek-OCR outperforms GOT-OCR2.0 (which uses 256 tokens/page).
- In higher resolution modes (fewer than 800 vision tokens), performance matches or exceeds MinerU2.0 (which processes over 6000 tokens/page), illustrating the efficacy of the deep optical compressor.
Corpus Coverage:
- DeepSeek-OCR demonstrates effective parsing for diverse documents: books, slides, newspapers, and supports nearly 100 languages, including parsing of charts, chemical formulas, and geometric figures.
Throughput: The pipeline can generate training data for LLMs/VLMs at a rate exceeding 200k pages per day on an A100-40G GPU, supporting high-scale applications.

4. Practical Implications and Applications

DeepSeek-OCR’s high-ratio optical compression yields direct and indirect benefits:

Efficient Long-Context Archiving: The model enables storage and later retrieval of long document contexts in highly compressed format, suitable for archival and retrieval under tight resource constraints.
Memory Systems in LLMs: By encoding past contexts at progressively higher compression, DeepSeek-OCR provides a mechanistic analogue for memory “forgetting,” conceptually similar to information decay in biological memory.
Multilingual Structured Parsing: The architecture supports diverse content types (text, tables, charts) and numerous languages, making it applicable for multilingual document pipelines.
LLM/VLM Training Data Generation: The throughput capabilities and robust deep parsing make DeepSeek-OCR a scalable tool for generating large instruction-tuning corpora, feeding downstream language and vision models.

5. Technical Challenges and Resolutions

Key engineering and scientific challenges addressed include:

High-Resolution Activation Management: DeepEncoder’s window attention and aggressive convolutional downsampling prevent quadratic scaling in memory and compute despite high input dimensions.
Maintaining Fidelity at Extreme Compression: Below $10\times$ compression, loss is minimal; above this threshold, information loss emerges, but the system shows graceful degradation, supporting selective fidelity based on downstream requirements.
Robust Multi-Resolution Processing: Support for both fixed native modes and dynamic ("Gundam") modes with resolution/tiling adaptation required interpolation of position encodings and dynamic adjustment of vision token budgets.
Training Pipeline Scalability: The training protocol is staged, with DeepEncoder trained first (using next-token prediction and a compact LLM), followed by pipeline-parallel joint optimization of both encoder and decoder components across 20 nodes, facilitating tractable large-scale training.

6. Future Research Directions

The DeepSeek-OCR paper identifies several avenues for further investigation:

Digital-Optical Interleaved Pretraining: Exploring joint training on sequences containing both text (digital) and image (optical) tokens could yield even more effective long-context representations and retrieval mechanisms.
Memory/Forgetting Mechanisms in LLMs: The paradigm of downsampling (“blurring”) historical context for controlled forgetting can enable more human-like memory systems for long-horizon LLMs.
Further Scale and Ultra-Long Contexts: With nearly lossless $10\times$ compression and moderate performance at $20\times$ , future instantiations may extend to virtually unlimited context processing via dynamic resource allocation.

A plausible implication is that future VLM and LLM frameworks may blend DeepSeek-OCR’s optical compression for older contexts and high-fidelity digital contexts for recent information, enabling systems to manage memory and retrieval over vast context windows.

DeepSeek-OCR is accessible via open-sourced code and model weights, supporting adoption and extension within both research and industrial ecosystems. Its architecture and methodology introduce a new paradigm in optical representation and compression of textual data, with robust empirical validation and clear prospects for both fundamental study and real-world deployment (Wei et al., 21 Oct 2025).

Markdown Upgrade to Chat

References (1)

DeepSeek-OCR: Contexts Optical Compression (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeepSeek-OCR.