LLMs’ natural integration of visual-to-text decoding via specialized pretraining
Determine whether large language models, when trained with specialized pretraining optimization, can more naturally learn the non-linear decoding mapping that reconstructs text representations from compressed latent (vision) tokens to textual token sequences, thereby enabling OCR-style recovery of text from compact visual representations at scale.
References
It is reasonable to conjecture that LLMs, through specialized pretraining optimization, would demonstrate more natural integration of such capabilities.
— DeepSeek-OCR: Contexts Optical Compression
(2510.18234 - Wei et al., 21 Oct 2025) in Subsection "The MoE Decoder" (Methodology)