Scaling verifiable supervision for RL in LVLMs using ordinary images

Develop reinforcement learning with verifiable rewards for large vision-language models that preserves the optimization benefits of reinforcement learning while scaling deterministically verifiable supervision to ordinary RGB or RGB-D images across diverse domains, without relying on manual labels, specialized assets, or costly tooling.

Background

The paper argues that spatial understanding is a persistent weakness of large vision-LLMs and that existing pipelines for supervised fine-tuning and reinforcement learning with verifiable rewards often depend on costly supervision, specialized tools, or constrained environments. These dependencies limit scalability and generalization across domains.

The authors explicitly identify the need to retain the optimization benefits of RL while making the verifiable supervision scalable and lightweight, i.e., derivable directly from ordinary images without manual annotation or external toolchains. Spatial-SSRL is proposed as a step toward addressing this challenge by repurposing self-supervised learning tasks as verifiable rewards.

References

As shown in \cref{fig:comp} (a), a key open challenge is to retain the optimization benefits of RL while scaling verifiable supervision to ordinary images across diverse domains without manual labels, specialized assets, or costly tooling.

— Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning (2510.27606 - Liu et al., 31 Oct 2025) in Section 1 (Introduction)

Scaling verifiable supervision for RL in LVLMs using ordinary images

Sponsor

Background

References

Related Problems