Optimality of Pixel-level Reconstruction for Improving MLLM Visual Understanding

Determine whether enforcing pixel-level visual reconstruction objectives during training of multimodal large language models (MLLMs) is the optimal strategy for enhancing their visual understanding capabilities.

Background

The paper contrasts reconstruction-based approaches—which require integrating additional visual generative components and objectives into understanding-focused MLLMs—with lightweight, verifiable, self-supervised alternatives such as the proposed Visual Jigsaw ordering task. While reconstruction has shown benefits for visual understanding, it increases architectural complexity and supervision requirements.

Motivated by these trade-offs, the authors explicitly question whether forcing models to achieve pixel-level fidelity is in fact the best way to strengthen visual understanding in MLLMs, and position Visual Jigsaw as a simpler, verifiable alternative that aligns with RL from verifiable rewards without modifying model outputs.

References

Furthermore, it remains an open question whether forcing models to achieve pixel-level reconstruction is the optimal strategy for enhancing MLLMsâ visual understanding.

— Visual Jigsaw Post-Training Improves MLLMs (2509.25190 - Wu et al., 29 Sep 2025) in Section 1 (Introduction)

Optimality of Pixel-level Reconstruction for Improving MLLM Visual Understanding

Background

References

Related Problems