Disentangling visual evidence versus linguistic priors in VLM-based OCR

Determine whether the performance gains of vision–language model-based optical character recognition systems that use a vision encoder plus an autoregressive language decoder primarily arise from faithful visual transcription or from the language model’s ability to infer plausible text using linguistic priors and global context, so as to attribute improvements to visual perception rather than language-driven completion.

Background

Modern OCR systems built on vision–LLMs typically follow a vision encoder plus autoregressive language decoder pipeline. While this design achieves strong benchmark scores, the causal, left-to-right decoding can encourage reliance on language priors, risking hallucinations and error propagation.

The paper introduces the Semantic Shuffle benchmark to perturb semantic coherence while keeping visual appearance comparable, aiming to probe how much recognition depends on visual evidence versus linguistic plausibility. The authors explicitly note that the field often lacks clarity on the source of observed performance gains, motivating this open question.

References

Although this paradigm achieves strong OCR scores, it is often unclear whether these gains come from faithful visual reading or from the decoder's ability to "fill in" plausible text using linguistic priors and global context.

MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding  (2603.22458 - Dong et al., 23 Mar 2026) in Section 4.5, Semantic Shuffle Analysis