Optimality of Pixel-level Reconstruction for Improving MLLM Visual Understanding
Determine whether enforcing pixel-level visual reconstruction objectives during training of multimodal large language models (MLLMs) is the optimal strategy for enhancing their visual understanding capabilities.
References
Furthermore, it remains an open question whether forcing models to achieve pixel-level reconstruction is the optimal strategy for enhancing MLLMsâ visual understanding.
— Visual Jigsaw Post-Training Improves MLLMs
(2509.25190 - Wu et al., 29 Sep 2025) in Section 1 (Introduction)