LLMs’ natural integration of visual-to-text decoding via specialized pretraining
Determine whether large language models, when trained with specialized pretraining optimization, can more naturally learn the non-linear decoding mapping that reconstructs text representations from compressed latent (vision) tokens to textual token sequences, thereby enabling OCR-style recovery of text from compact visual representations at scale.
Sponsor
References
It is reasonable to conjecture that LLMs, through specialized pretraining optimization, would demonstrate more natural integration of such capabilities.
— DeepSeek-OCR: Contexts Optical Compression
(2510.18234 - Wei et al., 21 Oct 2025) in Subsection "The MoE Decoder" (Methodology)