Disentangling visual evidence versus linguistic priors in VLM-based OCR
Determine whether the performance gains of vision–language model-based optical character recognition systems that use a vision encoder plus an autoregressive language decoder primarily arise from faithful visual transcription or from the language model’s ability to infer plausible text using linguistic priors and global context, so as to attribute improvements to visual perception rather than language-driven completion.
References
Although this paradigm achieves strong OCR scores, it is often unclear whether these gains come from faithful visual reading or from the decoder's ability to "fill in" plausible text using linguistic priors and global context.
— MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding
(2603.22458 - Dong et al., 23 Mar 2026) in Section 4.5, Semantic Shuffle Analysis