RLVR: Vision–Language Rewards for VLMs

Updated 31 May 2026

RLVR is a framework that trains vision–language models using programmatically verifiable rewards to align visual perceptions with language outputs.
It incorporates both sparse and dense reward signals to effectively tackle multi-step reasoning and optimize policies in tasks like visual reasoning, planning, and robotic manipulation.
The methodology uses group-wise normalized policy gradients and curriculum strategies, leading to improved sample efficiency and performance across various benchmarks.

Reinforcement Learning from Vision–Language Rewards (RLVR) formalizes a class of training algorithms that supervise vision–LLMs (VLMs) using programmatically verifiable reward signals, typically derived from automatically checkable outputs rather than human annotation. RLVR subsumes a wide range of techniques across visual reasoning, autonomous planning, and embodied manipulation, and has motivated the development of new optimization frameworks, dense reward assignment methods, and domain-specific extensions for efficient, robust VLM alignment.

1. Foundations of RLVR: Formalism and Workflow

RLVR casts the VLM as a policy $\pi_\theta$ generating token sequences conditioned on paired visual and textual inputs. At each time step $t$ , the model’s state is $s_t = (I, q, y_{<t})$ , where $I$ is the visual context, $q$ the query, and $y_{<t}$ the partial output. Actions correspond to the emission of tokens $y_t$ , leading to trajectories $y = (y_1, ..., y_T)$ that can be partitioned into visually grounded description segments, reasoning steps, and final answers (Zhang et al., 2 May 2026).

The objective is to maximize the expected (possibly discounted) return: $J(\pi_\theta) = \mathbb{E}_{y\sim\pi_\theta}\left[\sum_{t=1}^T \gamma^{t-1} r_t\right]$ with $\gamma \in [0,1]$ , where $t$ 0 is a reward signal—often sparse, with $t$ 1 given by an external verifier (e.g., answer correctness, bounding box IoU, trajectory error). RLVR typically employs group-relative policy optimization (GRPO, DAPO, GSPO), updating $t$ 2 by comparing groupwise normalized rewards and applying clipped policy gradients (Zhang et al., 2 May 2026, Han et al., 28 Mar 2026).

A fundamental distinction from purely text-based RL from verifiable rewards is the heterogeneity of skill types (visual perception, multi-step reasoning, format compliance) and the need for vision-grounded reward extraction pipelines.

2. Reward Structures: Sparse Signals, Dense Process Rewards, and Multi-Criteria Supervision

2.1 Sparse, Verifiable Rewards

Early RLVR methods adopt terminal rewards scored by executable functions on model outputs, e.g.:

Exact answer match: $t$ 3 for question answering.
Spatial grounding tasks: $t$ 4, sometimes quantized, for bounding box or segmentation alignment (Koksal et al., 29 Jul 2025, Song et al., 22 May 2025).
Planning metrics: negative displacement errors (ADE, FDE) for trajectory outputs (Oh, 17 Jul 2025).

This approach enables strict programmatic checking, but provides sparse signals, exacerbating credit assignment problems—especially in compositional or multi-stage reasoning.

2.2 Dense and Decomposed Rewards

Recognizing the credit assignment bottleneck, novel methods introduce intermediate rewards:

Confidence Growth (PACR): Reward steps based on the increase in model confidence for the ground-truth answer at each token; limitations arise in multimodal settings due to reward scale mismatch between visual and textual steps (Yoon et al., 13 May 2026).
PDCR: Decomposes process rewards based on a model-internal Visual Dependence Score, clustering steps into “visual” and “textual” (perception vs. reasoning). Advantages are normalized within each cluster, restoring proper gradient flow for both sparse perception and dense reasoning steps, and significantly improving sample efficiency (Yoon et al., 13 May 2026).
MIRL: Utilizes mutual information between early description tokens and the image as a lightweight screening reward, guiding sampling towards visually grounded hypotheses and further decoupling visual/perception and reasoning/answer rewards in policy updates (Zhang et al., 2 May 2026).
RLR³: Decomposes multi-part tasks into rubric criteria, each routed to deterministic execution or LLM-judged semantic scoring, then hierarchically aggregated to a final reward (Yu et al., 28 May 2026).

2.3 Curriculum, Hierarchical, and Control Strategies

Several works introduce staged or curriculum-style reward refinement, such as Vision-R1’s rule-tightening (IoU thresholds), as well as strategies for suppressing reward hacking via hierarchical aggregation, robust within-group normalization, and minimal exposure protocols (Zhan et al., 23 Mar 2025, Yu et al., 28 May 2026).

3. RLVR Algorithms: Policy Optimization, Sampling, and Decoupled Updates

RLVR implementations nearly always modify policy gradients with groupwise normalization. In its canonical form:

Sample $t$ 5 output trajectories per input using $t$ 6.
Evaluate each using the verifiable reward function, then compute group-mean and standard deviation to standardize rewards to advantages $t$ 7.
Optimize a PPO- or DPO-style surrogate loss, regularizing updates against a frozen reference policy ( $t$ 8), to stabilize behavior and prevent divergence from pretrained priors (Zhang et al., 2 May 2026, Han et al., 28 Mar 2026).

Recent strategies extend this by:

Splitting reward assignment across token segments (e.g. MIRL for description vs. reasoning).
Allocating sampling budgets via forking and MI-based pre-screeners (MIRL).
Integrating ground-truth labels within rollouts for joint SFT and RL (ViSurf) (Liu et al., 12 Oct 2025).
Assigning separate per-skill advantages in PDCR.

The following table summarizes key algorithmic enhancements:

Method	Reward Signal(s)	Sampling/Optimization
RLVR (basic)	Terminal verifiable reward (0/1, IoU, etc.)	Uniform rollout, GRPO/DAPO
MIRL	MI-based, task answer (decoupled)	MI pre-screen, forking
PDCR	Dense confidence (perception-reason split)	Step clustering, cluster norm.
ViSurf	SFT label and RL rollouts, with control utils	Joint-augmented advantage
RLR³	Rubric: multiple criteria, hierarchical control	Dual executor, within-group norm.

4. Domain Applications: Reasoning, Planning, Manipulation, and Data-Scarce Regimes

RLVR underpins alignment efforts in a wide variety of settings beyond standard visual question answering:

Vision–Language Reasoning: MIRL and PDCR have shown consistent gains on MathVista, MathVerse, We-Math, MMStar, and RealWorldQA, resolving limitations in sample efficiency and fine-grained optimization (Zhang et al., 2 May 2026, Yoon et al., 13 May 2026).
Autonomous Planning: LaViPlan applies RLVR to sequence prediction for autonomous driving, using ADE/FDE as verifiable reward and explicit KL-regularized GRPO for aligning high-level reasoning with trajectory outputs (Oh, 17 Jul 2025).
Robotic Manipulation: ManipLVM-R1 and Large Reward Models (LRM) generalize RLVR to instruction-conditioned affordance and trajectory prediction, integrating multiple spatial/logical reward components to surpass supervised methods, with strong generalization to out-of-distribution physical scenarios (Song et al., 22 May 2025, Wu et al., 17 Mar 2026).
Dense Visual Feedback: TeViR leverages pretrained text-to-video diffusion models to generate dense comparison metrics, improving sample efficiency and robustness over traditional VLM-based sparse reward RL (Chen et al., 26 May 2025).
Few-Shot and Data-Scarce Alignment: RLVR has proven effective in satellite imagery, where lightweight, programmatic binary and IoU-based verifiable rewards suffice to unlock double-digit gains over base models given as few as one training example (Koksal et al., 29 Jul 2025).

5. Reward Model Engineering: Rubrics, Multi-Criteria Execution, and Robustness Protocols

A marked trend in recent RLVR research is the explicit engineering of reward models and verifiers tailored to complex, partially checkable outputs:

Rubric-Based Rewards: RLR³ structures the reward as a sum over multiple task criteria, assigns criterion type (essential/additional), and deploys dual pipelines—deterministic extraction/verifier for checkable criteria, LLM-based credit assignment for fuzzy elements (Yu et al., 28 May 2026).
Score Normalization and Aggregation: Hierarchical aggregation ensures essential criteria must be satisfied before credit is transferred from additional ones, and within-group remapping prevents score saturation.
Minimal Exposure Strategy: Information masking in the reward pipelines blocks policies from exploiting knowledge of ground-truth reference objects or images needed only for checking, sharply reducing false-positive reward hacking in adversarial tests (Yu et al., 28 May 2026).
Trajectory and Path Rewards: For tasks with sequential outputs, e.g., robotic manipulation and planning, specialized reward functions based on trajectory similarity (Fréchet, Hausdorff, RMSE) and endpoint proximity are integrated for robust physical reasoning (Song et al., 22 May 2025, Oh, 17 Jul 2025).

6. Sample Efficiency, Performance Trends, and Empirical Benchmarks

Across domains, RLVR and its extensions lead to consistent accuracy and sample efficiency gains:

MIRL achieves 70.22% average accuracy over six reasoning benchmarks at 25% reduced full rollout cost compared to DAPO baselines; ablations confirm MI-guided selection and decoupled reward assignment are both necessary (Zhang et al., 2 May 2026).
PDCR shows +0.7 to 1.4 points average accuracy over prior global-reward and sparse-reward baselines, with fastest convergence on perception-heavy benchmarks (Yoon et al., 13 May 2026).
LaViPlan yields 15–20% relative reduction in trajectory errors (ADE/FDE), with significant improvements over supervised fine-tuning under distributional shift (Oh, 17 Jul 2025).
RLR³ obtains a +4.7-point gain in macro-average accuracy on a diverse set of vision–language reasoning tasks, surpassing the instruct-to-thinking model gap, and exhibits enhanced robustness to reward exploitation (Yu et al., 28 May 2026).
ViSurf outperforms both SFT and RLVR, reaching 69.6% average accuracy on a difficult multi-task suite and eliminating catastrophic forgetting relative to single-stage or staged baselines (Liu et al., 12 Oct 2025).

7. Limitations, Open Challenges, and Prospective Directions

Despite substantial progress, key challenges remain:

Skill Heterogeneity and Credit Assignment: Sparse or naively uniform process rewards remain insufficient for multi-modal, multi-step reasoning; fine-grained intra-segment reward allocation is critical (Yoon et al., 13 May 2026).
Reward Hacking and Exploitability: Methods without minimal exposure or hierarchical aggregation are vulnerable to adversarial behaviors that trigger false positives; robust vetting is essential (Yu et al., 28 May 2026, Schroeder et al., 30 Mar 2026).
Generalization and Task Complexity: RLVR performance may decay in tasks with incomplete, subjective, or unscorable outputs, where programmatic verifiers are unavailable or semantic credit is ambiguous.
Compute Cost and Inference Latency: Techniques involving dense video modeling, structured reward routing, and spatial alignment incur additional training overhead, though typically within acceptable margins for large-model training (Zhang et al., 2 May 2026, Han et al., 28 Mar 2026).
Data Scarcity: While shown to be sample-efficient in one-shot RLVR scenarios, extreme few-shot alignment can overfit specific metrics, necessitating prompt/loss engineering and regularization (Koksal et al., 29 Jul 2025).

Research continues on hybrid symbolic–neural verifiers, co-training of policy and reward models, and principled unification of supervised and reinforcement objectives for efficient, robust, and generalizable VLM alignment under the RLVR paradigm.