LLMs’ natural integration of visual-to-text decoding via specialized pretraining

Determine whether large language models, when trained with specialized pretraining optimization, can more naturally learn the non-linear decoding mapping that reconstructs text representations from compressed latent (vision) tokens to textual token sequences, thereby enabling OCR-style recovery of text from compact visual representations at scale.

Background

DeepSeek-OCR defines a decoder that reconstructs text from compressed latent vision tokens produced by the DeepEncoder, formalized as a non-linear mapping from latent visual token space to text token space. The report demonstrates that compact LLMs can effectively learn this OCR-style decoding from visual tokens.

In discussing the decoder’s capability, the authors explicitly conjecture that LLMs, through specialized pretraining optimization, would integrate this decoding capability more naturally. This raises a concrete question about whether LLMs can acquire and internalize such visual-to-text reconstruction functions more effectively than smaller models through appropriate pretraining.

References

It is reasonable to conjecture that LLMs, through specialized pretraining optimization, would demonstrate more natural integration of such capabilities.

— DeepSeek-OCR: Contexts Optical Compression (2510.18234 - Wei et al., 21 Oct 2025) in Subsection "The MoE Decoder" (Methodology)

LLMs’ natural integration of visual-to-text decoding via specialized pretraining

Background

References

Related Problems