Influence of reward design on latent visual reasoning in MLLMs

Investigate how different reinforcement learning reward designs influence latent visual reasoning in multimodal large language models that generate continuous latent visual embeddings as intermediate thoughts during reasoning, and determine the impact of alternative reward functions on latent-embedding quality and overall task performance.

Background

The paper introduces Monet, a training framework enabling multimodal LLMs (MLLMs) to perform reasoning in a latent visual space by generating continuous embeddings as intermediate visual thoughts. Monet combines a three-stage supervised fine-tuning pipeline with a reinforcement learning algorithm (VLPO) that explicitly optimizes latent embeddings using reward signals.

In the reinforcement learning setup, the authors adopt a simple reward scheme (accuracy and format rewards) and explicitly avoid rewarding latent-reasoning behavior itself. In the limitations, they note that the influence of alternative reward designs on latent visual reasoning has not been explored, highlighting a concrete gap in understanding how reward choice affects latent-space reasoning quality and performance.

References

Second, we have not yet explored how different reward designs might influence latent visual reasoning in MLLMs, leaving room for exploration and further enhancement.

— Monet: Reasoning in Latent Visual Space Beyond Images and Language (2511.21395 - Wang et al., 26 Nov 2025) in Section 6 Conclusion and Limitations

Influence of reward design on latent visual reasoning in MLLMs

Background

References

Related Problems