Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 40 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 101 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 161 tok/s Pro
2000 character limit reached

Vision-SR1: Self-Rewarding VLM Framework

Updated 28 August 2025
  • Vision-SR1 is a vision-language model training framework that mitigates visual hallucinations and language shortcuts by decomposing reasoning into distinct visual perception and language reasoning stages.
  • The framework employs a novel self-rewarding mechanism with reinforcement learning, applying a composite reward function to supervise both intermediate and final outputs.
  • Experimental validations show improved visual reasoning accuracy and reduced false content generation compared to existing models on diverse multimodal benchmarks.

Vision-SR1 is a vision-LLM (VLM) training framework designed to mitigate two principal failure modes—visual hallucination and reliance on language shortcuts—by introducing a self-rewarding, reasoning-decomposed learning paradigm. Unlike previous VLM post-training algorithms that supervise only the final output and frequently require external visual annotations, Vision-SR1 decomposes reasoning into visual perception and language reasoning stages and applies a novel self-reward mechanism based on reinforcement learning to adaptively supervise both components. This method is intended to encourage grounded image understanding and robust language generation by providing balanced training signals at intermediate and final steps.

1. Motivation and Problem Statement

Vision-LLMs such as recent large multimodal transformers commonly exhibit the following deficiencies:

  • Visual Hallucinations: The tendency to generate descriptions or answers containing content not present in the visual input, often exacerbated by strong language priors.
  • Language Shortcuts: The practice of sidestepping the intended visual reasoning process and instead exploiting linguistic regularities or patterns in the training data.

These problems are rooted in sparse or misdirected visual supervision—i.e., post-training methods primarily optimize for answer correctness and neglect explicit guidance at the intermediate stages of visual reasoning. Attempts to correct this have included adding auxiliary rewards based on human annotations or distilled labels from teacher models; however, these approaches are costly, lack adaptability, and can cause policy drift or reward hacking under evolving model policies (Li et al., 27 Aug 2025).

2. Reasoning Decomposition

Vision-SR1 enforces a strict separation between "visual perception" and "language reasoning" during the answer generation process. The protocol is:

  1. Visual Perception Stage: Given a question qq and an image II, the model produces a text chunk cc encapsulated within dedicated tags (e.g., <visual perception>cc</visual perception>) containing all visual information necessary to answer qq.
  2. Language Reasoning Stage: The model receives only the question qq and the generated visual perception cc and proceeds to generate chain-of-thought reasoning tt (> tt</think>) and a final answer aa (<answer>aa</answer>), without referring back to II.

    The output format is explicitly structured:

    Stage Format Description
    Visual Perception <visual perception>cc</visual perception> All necessary visual details
    Reasoning <think>tt</think> Textual intermediate reasoning
    Answer <answer>aa</answer> Final answer

    This separation allows the system to assess whether the intermediate perception cc is sufficient for correct downstream reasoning without implicit reference to the image.

    3. Self-Rewarding Mechanism via Reinforcement Learning

    Vision-SR1 utilizes reinforcement learning to supervise both intermediate and final outputs by defining a composite reward function:

    r(Q,s)=rvisual(Q,c)+rans(Q,a)+αrfmt(s)r(Q, s) = r_{\mathrm{visual}}(Q, c) + r_{\mathrm{ans}}(Q, a) + \alpha r_{\mathrm{fmt}}(s)

    • Q=(q,I)Q = (q, I) is the original vision-language query.

    • ss is the output sequence.
    • rans(Q,a)r_{\mathrm{ans}}(Q, a) is a binary reward for answer correctness (matches ground truth).
    • rvisual(Q,c)r_{\mathrm{visual}}(Q, c) is a binary reward based on secondary rollout: the model is re-prompted with qq and cc only, and if the derived answer matches the ground truth, cc is judged "self-contained."
    • rfmt(s)r_{\mathrm{fmt}}(s) is a formatting reward enforcing output structure.
    • α\alpha is a tunable hyperparameter for format reward scaling.

    The policy optimization is performed via group relative policy optimization (GRPO):

    A^grp(Q,sk)=r(Q,sk)1Kj=1Kr(Q,sj)\hat{A}^{\mathrm{grp}}(Q, s_k) = r(Q, s_k) - \frac{1}{K} \sum_{j=1}^K r(Q, s_j)

    LGRPO(θ)=EQ[k=1KA^grp(Q,sk)logπθ(skQ)βKL(πθ(Q)πθ0(Q))]\mathcal{L}_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{Q}\left[\sum_{k=1}^{K} \hat{A}^{\mathrm{grp}}(Q, s_k) \log \pi_{\theta}(s_k|Q) - \beta \cdot \mathrm{KL}( \pi_{\theta}(\cdot|Q)\|\pi_{\theta_0}(\cdot|Q))\right]

    Here, KK is the batch size per query, β\beta is the KL regularization factor, and πθ\pi_{\theta} is the current policy.

    This framework supplies targeted, variance-controlled gradients to both visual perception and answer-generation modules, encouraging the model to ground outputs in the image rather than purely in linguistic patterns.

    4. Experimental Validation

    Vision-SR1 is tested across diverse multimodal tasks, including general image understanding, multimodal math, and hallucination detection. Empirical results indicate:

    • Visual Reasoning Accuracy: On benchmarks such as MMMU, MM-Vet, and RealWorldQA, Vision-SR1 achieves higher scores compared to established baselines. For example, a Qwen2.5-VL-7B backbone obtained an average score of 58.8 across eight benchmarks, surpassing Vision-R1 (≈57.4).
    • Reduction in Visual Hallucinations: The system demonstrates quantifiable improvements in reducing the rate of false or ungrounded content in responses.
    • Shortcut Avoidance: Language Shortcut Rate (LSR) metrics decrease, indicating improved reliance on visual grounding rather than linguistic guesswork.

    These improvements stem from the explicit supervision at both visual and reasoning stages, which calibrates the learned representations away from shortcut-prone behavior.

    5. Comparison with Existing Approaches

    Prior methods address visual grounding mainly via final answer supervision, optionally supplemented with external annotation-based or teacher-model distilled rewards. Such external signals are inflexible, scale poorly, and may shift distributionally as policy adapts, leading to reward hacking. Vision-SR1's self-contained, adaptive protocol:

    • Requires no external visual annotation or model distillation.
    • Produces supervisory signals dynamically by validating the sufficiency of visual perception via internal dual-rollout.
    • Ensures mutual reinforcement of visual detail extraction and language reasoning.
    • Results in enhanced model robustness and interpretability as all answer rationales are explicitly represented in structured outputs.

    6. Implications and Future Directions

    The Vision-SR1 paradigm suggests new possibilities for scalable, annotation-free, multimodal VLM training. Notable implications include:

    • Increased trustworthiness and interpretability by requiring "explanatory" intermediate outputs before answers.
    • Scalability to new domains without dependence on annotated external visual rewards.
    • Foundations for more rigorous disentanglement of visual grounding from linguistic priors.

    Future work is expected to address enhancements such as direct supervision on visual embeddings (to minimize textual conversion information loss) and systematic quantification of true visual reasoning gains. Comprehensive evaluation of RL-derived improvements in grounding and generalization remains an open challenge.

    7. Summary Table: Vision-SR1 Workflow

    Step Input Output Reward
    Visual Perception (image, question) <visual perception>cc</visual perception> rvisualr_{\mathrm{visual}} (if self-contained)
    Language Reasoning (question, perception) <think>tt <answer>aa</answer> ransr_{\mathrm{ans}} (answer correct)
    All stages (question/image) Structured tags (<visual perception>, <think>, <answer>) Format adherence rfmtr_{\mathrm{fmt}}

Vision-SR1 provides robust, decomposed RL supervision for vision-LLMs and demonstrates marked improvements in both visual grounding and answer fidelity on standard benchmarks (Li et al., 27 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)