DeepSeek-OCR: Contexts Optical Compression (2510.18234v1)

Published 21 Oct 2025 in cs.CV

Abstract: We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10x), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20x, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G). Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR.

Summary

The paper introduces a novel vision-text compression method that reduces OCR token counts and achieves over 97% precision at high compression ratios.
It employs a two-component architecture, combining a DeepEncoder with a DeepSeek-3B-MoE decoder, to efficiently process large-scale, multi-language documents.
The methodology demonstrates robustness across diverse applications, including parsing charts, chemical formulas, and complex document layouts.

DeepSeek-OCR: Contexts Optical Compression

DeepSeek-OCR is an innovative approach towards vision-text compression tailored for large-scale document OCR systems. This technique explores reducing the token count required for processing extensive contexts in text documents, potentially revolutionizing how Vision-LLMs (VLMs) manage large inputs.

Introduction

Current LLMs primarily face the challenge of quadratic scaling when dealing with long sequences, making the efficient handling of large documents problematic. DeepSeek-OCR proposes leveraging vision as a medium for compressing text into visual tokens, reducing token count and offering a potential method for bypassing computational limits associated with long sequences.

Figure 1: Compression on Fox benchmark.

This model capitalizes on the inherently dense representation capability of images, avoiding the explosion of token counts faced by text-only models. The key innovation lies in converting text directly into vision tokens through an architecture consisting of the DeepEncoder and DeepSeek3B-MoE-A570M decoder. With a focus on real-world applications, DeepSeek-OCR achieves notable decodification precision at compression ratios as high as 20×, maintaining accuracy even under extreme compression scenarios.

Methodology

Architecture

DeepSeek-OCR comprises two main components – the DeepEncoder and the decoder (DeepSeek3B-MoE). The DeepEncoder is tasked with processing high-resolution inputs efficiently, followed by token compression to reduce memory footprint.

Figure 2: Typical vision encoders in popular VLMs. Here are three types of encoders commonly used in current open-source VLMs, all of which suffer from their respective deficiencies.

The DeepEncoder integrates two primary constituents: a window-attentive feature extraction component and a globally attentive knowledge extraction component. This design facilitates a seamless transition from detailed visual processing to compressed representation, utilizing a 16× convolutional down-sampler to optimize performance without overwhelming computational resources.

Figure 3: The architecture of DeepSeek-OCR. DeepSeek-OCR consists of a DeepEncoder and a DeepSeek-3B-MoE decoder. DeepEncoder is the core of DeepSeek-OCR, comprising three components: a SAM for perception dominated by window attention, a CLIP for knowledge with dense global attention, and a 16x token compressor that bridges between them.

Vision-Text Compression Capabilities

DeepSeek-OCR's performance on compression tasks is evaluated using the Fox benchmarks, revealing its capacity to compress text effectively while retaining high decoding precision at various compression ratios. For instance, it achieves over 97% precision at compression ratios of 10× during testing.

Figure 4: To test model performance under different compression ratios (requiring different numbers of vision tokens) and enhance the practicality of DeepSeek-OCR, we configure it with multiple resolution modes.

Specific tests demonstrated that vision models compressed textually rich documents with promising results, showcasing superior performance compared to traditional OCR counterparts such as GOT-OCR2.0 and MinerU2.0.

Multimodal Training and Capabilities

The model's practical capabilities extend beyond traditional OCR tasks to include parsing of charts, chemical formulas, geometric figures, and natural images, enabled by its training on an extensive range of document types and languages.

Figure 5: OCR 1.0 fine annotations display. We format the ground truth into an interleaved layout and text format, where each paragraph of text is preceded by the coordinates and label of it in the original image. All coordinates are normalized into 1000 bins.

DeepSeek-OCR also supports multiple input resolutions and complex document layouts, making it adaptable for varied applications across different document types and languages, as indicated by its capability to train on nearly 100 languages.

Evaluation and Practical Implications

The examination of DeepSeek-OCR on the OmniDocBench benchmark attested to its robust performance on real-world document parsing tasks, where it demonstrated competitive accuracy and efficiency in terms of token usage.

Figure 6: For charts, we do not use OneChart's dictionary format but instead use HTML table format as labels, which can save a certain amount of tokens.

Figure 7: In the field of financial research reports, the deep parsing mode of DeepSeek-OCR can be used to obtain structured results of charts within documents.

The results ascertain DeepSeek-OCR's elevated capability as an end-to-end model within a practical OCR context. At merely 100 vision tokens, equivalent to drastically reduced computational resource usage, DeepSeek-OCR surpasses competing models with significantly higher token requirements.

Conclusion

DeepSeek-OCR heralds a new horizon in the field of VLMs by introducing vision-text compression as a viable means for handling extended contexts within LLMs. The method not only promises computational efficiency but also retains a reasonable degree of precision, making it suitable for large-scale, multi-lingual OCR applications. Further exploration of this approach could bridge existing gaps between extensive context processing capabilities and computational resource constraints, providing a solid foundation for future advancements in VLMs and LLMs.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper introduces DeepSeek-OCR, a new system that turns pictures of documents into text very efficiently. The big idea is to use images as a smart way to “compress” long text so that AI models (like chatbots) can handle more information without using tons of computing power. The authors show that a small number of image pieces (called vision tokens) can hold the same information as many more text pieces (text tokens), and the AI can still read it back accurately.

What questions did the researchers ask?

To make the idea simple, think of a long document like a huge puzzle. Each small piece is a “token.” The researchers wanted to know:

How few image pieces (vision tokens) do we need to rebuild the original text pieces (text tokens) correctly?
Can we design an image reader (encoder) that handles big, high-resolution pages without using too much memory?
Will this approach work well on real-world document tasks (like PDFs with text, tables, charts, formulas), and can it be fast enough for practical use?

How does DeepSeek-OCR work?

Imagine a two-part team: a “camera brain” that breaks the page image into meaningful chunks, and a “reading brain” that turns those chunks back into clean text.

The camera brain is called DeepEncoder. It has two parts:
- A “window attention” part (like scanning the page through many small windows to find details). This uses SAM.
- A “global attention” part (like stepping back to see the whole page and connect the dots). This uses CLIP.
Between those two parts, there’s a “shrinker” (a 16× compressor) that reduces how many image tokens need to be handled globally. Think of it as shrinking a big photo while keeping the important details clear enough to read.
The reading brain is a LLM (DeepSeek-3B-MoE). “MoE” means “Mixture of Experts”: multiple small specialists inside the model each focus on what they’re best at, which makes reading faster and smarter.

To handle different kinds of pages, they trained the system with multiple resolution modes:

Tiny and Small for fewer tokens (quick scans),
Base and Large for more detail,
Gundam/Gundam-M for very big or dense pages (cutting the page into tiles plus a global view, like reading both zoomed-in pieces and the full page).

For training, they used several kinds of data:

OCR 1.0: regular documents and street/scene text.
OCR 2.0: tricky things like charts, chemical formulas, and geometry drawings.
General vision: tasks like captions and object detection to keep some broad image understanding.
Text-only: to keep the language skills strong.

They trained the encoder first, then the full system, using lots of pages and smart prompts (instructions) for different tasks.

What did they find, and why does it matter?

The results are impressive and easy to understand with a few key numbers:

Compression power:
- About 97% accuracy when the text is compressed to within 10× (meaning the number of text tokens is roughly 10 times the number of vision tokens).
- Around 60% accuracy even at 20× compression (much tighter packing).
Real-world performance:
- On OmniDocBench, an evaluation of document parsing:
- DeepSeek-OCR beat GOT-OCR2.0 using only 100 vision tokens.
- It also outperformed MinerU2.0 while using fewer than 800 vision tokens (MinerU used nearly 7,000).
Versatility:
- It can handle not just plain text, but also charts, chemical formulas, simple geometry, and natural images.
- It supports nearly 100 languages for PDF OCR.
Speed and scale:
- It can generate training data at very large scale (200k+ pages per day on one GPU; tens of millions per day on a cluster), which is valuable for building better AI models.

Why it matters: If AI can compress and read long text using images, then big LLMs can remember more context (like conversation history or long documents) while using fewer resources. That means faster, cheaper, and more capable AI systems.

What could this change in the future?

The paper suggests a new way to handle long-term memory in AI:

Optical context compression: Turn earlier parts of a long conversation or document into images and read them with fewer tokens. Keep recent parts in high detail, and compress older parts more. This mimics “forgetting” like humans do: older memories become fuzzier and take less space, but important recent ones stay sharp.
Better long-context AI: With smart compression, AI could handle much longer histories without running out of memory.
Practical tools: DeepSeek-OCR is already useful—for creating training data, doing large-scale OCR, and parsing complex documents with charts and formulas—while using fewer resources.

In short, DeepSeek-OCR shows that “a picture is worth a thousand words” can be true for AI too. By packing text into images and reading it back efficiently, we can build faster, smarter systems that remember more without slowing down.

View Paper Prompt View All Prompts

Open Questions

We haven't generated a list of open questions mentioned in this paper yet.

Generate Now

Continue Learning

Authors (3)

Collections

GitHub

GitHub - deepseek-ai/DeepSeek-OCR: Contexts Optical Compression (9,019 stars)
GitHub - deepseek-ai/DeepSeek-OCR: Contexts Optical Compression (9,019 stars)

DeepSeek-OCR: Contexts Optical Compression (2510.18234v1)

Summary

DeepSeek-OCR: Contexts Optical Compression

Introduction

Methodology

Architecture

Vision-Text Compression Capabilities

Multimodal Training and Capabilities

Evaluation and Practical Implications

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How does DeepSeek-OCR work?

What did they find, and why does it matter?

What could this change in the future?

Open Questions

Continue Learning

Related Papers

Authors (3)

Collections

GitHub