Does RL post-training in MLLMs truly leverage visual information

Determine whether reinforcement learning–based post-training for Multimodal Large Language Models (such as Qwen2.5-VL) truly enables the models to learn from and utilize visual information in the training inputs, rather than primarily strengthening internal text-based reasoning patterns.

Background

The paper investigates reinforcement learning (RL) for post-training Multimodal LLMs (MLLMs), noting that many prior works report accuracy gains but do not clarify whether these gains arise from genuine visual understanding. Because common reward designs are based on final-answer correctness and are modality-agnostic, it is unclear if RL exploits visual grounding or merely enhances textual reasoning.

To study this, the authors introduce a Hallucination-as-Cue framework with modality-specific corruptions that remove or replace essential visual or textual information. This framework analyzes whether and how RL-based training learns under conditions that compel hallucinated reasoning, thereby probing the extent to which visual inputs contribute during training.

References

Although many studies have reported improved performance, it remains unclear whether RL training truly enables models to learn from visual information.

— Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models (2604.03179 - Zhang et al., 3 Apr 2026) in Abstract (page 1)

Does RL post-training in MLLMs truly leverage visual information

Background

References

Related Problems