Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 91 tok/s

Gemini 2.5 Pro 58 tok/s Pro

GPT-5 Medium 29 tok/s

GPT-5 High 29 tok/s Pro

GPT-4o 102 tok/s

GPT OSS 120B 462 tok/s Pro

Kimi K2 181 tok/s Pro

2000 character limit reached

Self-Rewarding Vision-Language Model via Reasoning Decomposition (2508.19652v1)

Published 27 Aug 2025 in cs.CV

Abstract: Vision-LLMs (VLMs) often suffer from visual hallucinations, saying things that are not actually in the image, and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post-training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language-based reasoning over visual perception. To mitigate this, some existing methods add visual supervision using human annotations or distilled labels from external large models. However, human annotations are labor-intensive and costly, and because external signals cannot adapt to the evolving policy, they cause distributional shifts that can lead to reward hacking. In this paper, we introduce Vision-SR1, a self-rewarding method that improves visual reasoning without relying on external visual supervisions via reinforcement learning. Vision-SR1 decomposes VLM reasoning into two stages: visual perception and language reasoning. The model is first prompted to produce self-contained visual perceptions that are sufficient to answer the question without referring back the input image. To validate this self-containment, the same VLM model is then re-prompted to perform language reasoning using only the generated perception as input to compute reward. This self-reward is combined with supervision on final outputs, providing a balanced training signal that strengthens both visual perception and language reasoning. Our experiments demonstrate that Vision-SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision-language tasks.

Collections

Summary

The paper introduces a reinforcement learning framework that decomposes reasoning to separately supervise visual perception and language reasoning.
It reduces visual hallucinations and language shortcuts by assigning self-visual rewards, achieving significant gains on benchmarks like VisNumBench.
The study demonstrates enhanced visual grounding and preserved text-only performance, offering a scalable strategy for multimodal training.

Self-Rewarding Vision-LLMs via Reasoning Decomposition: An Expert Analysis

Introduction

The paper "Self-Rewarding Vision-LLM via Reasoning Decomposition" (Vision-SR1) (2508.19652) addresses persistent challenges in vision-LLMs (VLMs), specifically visual hallucinations and language shortcuts. These issues stem from the dominance of language priors and the lack of explicit supervision for intermediate visual reasoning in most post-training pipelines. Vision-SR1 introduces a reinforcement learning (RL) framework that decomposes reasoning into visual perception and language reasoning, and crucially, leverages a self-rewarding mechanism to provide direct, adaptive supervision for visual perception without external annotations or reward models.

Methodology

Reasoning Decomposition and Self-Reward

Vision-SR1 enforces a structured output format for VLMs, explicitly separating visual perception, chain-of-thought (CoT) reasoning, and the final answer. The RL training loop consists of two rollouts:

Standard Rollout: The model receives an image-query pair and generates a structured output: visual perception, CoT reasoning, and answer. The answer is compared to the ground truth for an accuracy reward.
Self-Reward Rollout: The model is re-prompted with only the query and its own generated visual perception. If it can recover the correct answer, a self-visual reward is assigned, indicating that the perception is self-contained and sufficient.
Figure 1: Overall framework of Vision-SR1. During RL training, the VLM performs two rollouts: one with the image-query pair and one with only the query and generated visual perception, assigning rewards based on answer correctness and self-containment.

This dual-reward structure is combined with a format reward to ensure adherence to the required output structure. The total reward is a weighted sum of these components, with hyperparameters tuned empirically.

Theoretical Motivation

The approach is motivated by the observation that answer-only rewards in RL for VLMs encourage models to exploit language priors, neglecting the visual input. By introducing a perception-level reward, Vision-SR1 increases the mutual information between the visual input and the answer, anchoring the model's reasoning in actual perception rather than spurious correlations.

Data Curation and SFT Cold Start

The RL dataset (Vision-SR1-47K) is curated from 24 open-source VLM benchmarks, spanning mathematical reasoning, commonsense knowledge, and general visual understanding. To facilitate efficient RL training, a cold-start supervised fine-tuning (SFT) dataset is constructed by prompting Qwen-2.5-VL-7B to generate high-quality, format-compliant examples, filtered to ensure both answer and perception correctness.

Figure 2: SFT cold-start dataset curation process, intersecting three source datasets and applying multi-stage filtration to ensure format and answer correctness.

Experimental Results

Baselines and Evaluation

Vision-SR1 is compared against Vision-R1, Perception-R1, and Visionary-R1, with all models retrained on the same 47K dataset for fair comparison. Evaluation covers general visual understanding (MMMU, MMMU-Pro, MM-Vet, RealWorldQA, VisNumBench), multimodal mathematical reasoning (MathVerse, MATH-Vision), and hallucination diagnosis (HallusionBench). Gemini-2.5-flash is used as an LLM-as-a-judge for open-ended responses.

Main Results

Vision-SR1 consistently outperforms all baselines across both 3B and 7B backbones. For example, with Qwen2.5-VL-7B, Vision-SR1 achieves an average score of 58.8, compared to 57.4 for Vision-R1 and 55.1 for SFT. Notably, Vision-SR1 demonstrates the largest gains on benchmarks requiring genuine visual reasoning, such as VisNumBench and HallusionBench.

Ablation and Shortcut Analysis

Ablation studies confirm that removing the self-reward for visual perception degrades performance, particularly in tasks requiring visual grounding. The Language Shortcut Rate (LSR), defined as the fraction of correct answers produced with incorrect visual perception, is significantly reduced by Vision-SR1, indicating a lower reliance on language priors and improved visual grounding.

Text-Only Reasoning

Interestingly, Vision-SR1 also mitigates the degradation of text-only reasoning performance observed in multimodal RL training. On text-only benchmarks (MMLU-Pro, SuperGPQA, GSM8K, MATH-500), Vision-SR1 preserves or improves performance relative to Vision-R1, suggesting that decoupling perception and reasoning rewards helps maintain generalization.

Implementation Considerations

Architecture: Vision-SR1 can be implemented atop any VLM with a modular architecture supporting structured outputs. The method is demonstrated with Qwen-2.5-VL-3B/7B.
Training: The RL phase uses Group Relative Policy Optimization (GRPO) for stable policy updates. The self-reward rollout requires efficient re-prompting and caching of intermediate perceptions.
Data: High-quality, format-compliant SFT data is critical for cold-starting RL. The filtration pipeline must ensure that visual perceptions are both correct and self-contained.
Evaluation: Automated LLM-as-a-judge pipelines (e.g., Gemini-2.5-flash) are necessary for scalable, fine-grained evaluation of both answer correctness and perception self-containment.

Implications and Future Directions

Vision-SR1 demonstrates that self-rewarding mechanisms can provide scalable, adaptive supervision for visual perception in VLMs, reducing the need for costly human annotations or static external reward models. The explicit decomposition of reasoning aligns with the modularity hypothesis in cognitive science and may facilitate more interpretable and controllable VLMs.

However, the reliance on text-based proxies for visual perception introduces potential information loss. Future work should explore direct supervision of visual embeddings and more sophisticated self-rewarding strategies, including self-play and self-evolving VLMs. Additionally, the findings raise important questions about the true nature of improvements in multimodal RL—whether they reflect genuine perception gains or merely better exploitation of language shortcuts. More rigorous benchmarks and analysis are needed to disentangle these effects.

Conclusion

Vision-SR1 advances the state of the art in vision-language alignment by introducing a self-rewarding, reasoning-decomposed RL framework that directly supervises visual perception. The method yields consistent improvements in visual reasoning, reduces hallucinations and language shortcuts, and preserves text-only reasoning capabilities. This work provides a scalable blueprint for future VLM training pipelines and highlights the importance of explicit, adaptive supervision for intermediate reasoning steps in multimodal models.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (11)

Tweets

https://twitter.com/wyu_nd/status/1960880692990238899

https://twitter.com/morris_phd/status/1961037720991539538

https://twitter.com/griffintaur/status/1961141778922274817

https://twitter.com/arxivsanitybot/status/1961064299612582383

alphaXiv

Self-Rewarding Vision-Language Model via Reasoning Decomposition (79 likes, 0 questions)