Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 93 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 17 tok/s
GPT-5 High 14 tok/s Pro
GPT-4o 97 tok/s
GPT OSS 120B 455 tok/s Pro
Kimi K2 194 tok/s Pro
2000 character limit reached

Reinforcement Learning in Vision: A Survey (2508.08189v2)

Published 11 Aug 2025 in cs.CV

Abstract: Recent advances at the intersection of reinforcement learning (RL) and visual intelligence have enabled agents that not only perceive complex visual scenes but also reason, generate, and act within them. This survey offers a critical and up-to-date synthesis of the field. We first formalize visual RL problems and trace the evolution of policy-optimization strategies from RLHF to verifiable reward paradigms, and from Proximal Policy Optimization to Group Relative Policy Optimization. We then organize more than 200 representative works into four thematic pillars: multi-modal LLMs, visual generation, unified model frameworks, and vision-language-action models. For each pillar we examine algorithmic design, reward engineering, benchmark progress, and we distill trends such as curriculum-driven training, preference-aligned diffusion, and unified reward modeling. Finally, we review evaluation protocols spanning set-level fidelity, sample-level preference, and state-level stability, and we identify open challenges that include sample efficiency, generalization, and safe deployment. Our goal is to provide researchers and practitioners with a coherent map of the rapidly expanding landscape of visual RL and to highlight promising directions for future inquiry. Resources are available at: https://github.com/weijiawu/Awesome-Visual-Reinforcement-Learning.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper formalizes visual RL challenges and categorizes over 200 studies into four thematic pillars based on architectures and reward paradigms.
  • It details three alignment paradigms—RLHF, DPO, and RLVR—highlighting their computational trade-offs and implications for training stability.
  • The survey outlines future challenges including sample efficiency, safe deployment, and refined reward model design to advance visual decision-making.

Reinforcement Learning in Vision: A Survey

Introduction and Scope

This survey provides a comprehensive synthesis of the intersection between reinforcement learning (RL) and visual intelligence, focusing on the rapid evolution of RL methodologies for multimodal large models, including vision-LLMs (VLMs), vision-language-action (VLA) agents, diffusion-based visual generation, and unified multimodal frameworks. The work formalizes visual RL problems, traces the development of policy optimization strategies, and organizes over 200 representative studies into four thematic pillars: multimodal LLMs, visual generation, unified models, and VLA agents. The survey critically examines algorithmic design, reward engineering, benchmark progress, and evaluation protocols, while identifying open challenges such as sample efficiency, generalization, and safe deployment. Figure 1

Figure 1: Timeline of Representative Visual Reinforcement Learning Models, organized into Multimodal LLM, Visual Generation, Unified Models, and VLA Models from 2023 to 2025.

Formalization and Alignment Paradigms

The survey casts text and image generation as episodic Markov Decision Processes (MDPs), where the user prompt serves as the initial state and each generated token or pixel patch is an action sampled autoregressively from the policy. Three major alignment paradigms are delineated:

  • RL from Human Feedback (RLHF): Utilizes pairwise human preference data to train a scalar reward model, which is then used to fine-tune the policy via KL-regularized PPO. RLHF pipelines typically follow a three-stage recipe: supervised policy pre-training, reward model training, and RL fine-tuning.
  • Direct Preference Optimization (DPO): Removes the intermediate reward model and RL loop, directly optimizing a contrastive objective against a frozen reference policy. DPO is computationally efficient and avoids the need for value networks or importance sampling.
  • Reinforcement Learning with Verifiable Rewards (RLVR): Replaces subjective human preferences with deterministic, programmatically checkable reward signals (e.g., unit tests, IoU thresholds), enabling stable and scalable RL fine-tuning. Figure 2

    Figure 2: Three Alignment Paradigms for Reinforcement Learning: RLHF, DPO, and RLVR, each with distinct reward sources and optimization strategies.

Policy Optimization Algorithms

Two representative policy optimization algorithms are discussed:

  • Proximal Policy Optimization (PPO): Employs a learned value model for advantage estimation and injects a KL penalty at each token to maintain proximity to a reference policy. PPO is widely used for RLHF and verifiable reward settings.
  • Group Relative Policy Optimization (GRPO): Eliminates the value network, instead computing group-normalized advantages across multiple continuations for the same prompt. GRPO offers lower memory footprint, higher training stability, and transparent trade-offs between reward maximization and reference anchoring. Figure 3

    Figure 3: PPO uses a value model for advantage estimation and token-wise KL regularization; GRPO computes group-normalized advantages and applies prompt-level KL penalties.

Taxonomy of Visual RL Research

The survey organizes visual RL research into four high-level domains:

  1. Multimodal LLMs (MLLMs): RL is applied to align vision-language backbones with verifiable or preference-based rewards, improving robustness and reducing annotation costs. Extensions include curriculum-driven training and consistency-aware normalization.
  2. Visual Generation: RL fine-tunes diffusion and autoregressive models for image, video, and 3D generation. Reward paradigms include human-centric preference optimization, multimodal reasoning-based evaluation, and metric-driven objective optimization. Figure 4

    Figure 4: Three reward paradigms for RL-based image generation: human-centric preference, multimodal reasoning, and metric-driven objective optimization.

  3. Unified Models: Unified RL methods optimize a shared policy across heterogeneous multimodal tasks under a single reinforcement signal, promoting cross-modal generalization and reducing training cost.
  4. Vision-Language-Action Agents: RL is used for GUI automation, visual navigation, and manipulation, leveraging rule-based or preference rewards for robust, long-horizon decision-making.

Evaluation Protocols and Metric Granularity

The survey formalizes evaluation metrics at three granularities:

  • Set-level metrics: Evaluate the generative policy over the entire prompt set (e.g., FID, CLIPScore).
  • Sample-level metrics: Provide per-output rewards for policy optimization (e.g., RLHF, DPO).
  • State-level metrics: Monitor training-time signals such as KL divergence or output length drift for stability diagnostics. Figure 5

    Figure 5: Metric Granularity in Visual RL: set-level, sample-level, and state-level metrics for evaluation and training diagnostics.

Benchmarks are discussed for each domain, with RL-focused datasets providing human preference data, verifiable success criteria, and step-wise reasoning annotations. The survey highlights the need for evaluation standards that capture real-world utility, ethical alignment, and energy footprint.

Challenges and Future Directions

Key challenges identified include:

  • Reasoning Calibration: Balancing depth and efficiency in visual reasoning, with adaptive horizon policies and meta-reasoning evaluators.
  • Long-Horizon RL in VLA: Addressing sparse rewards and credit assignment through intrinsic sub-goal discovery, affordance critics, and hierarchical RL.
  • RL for Visual Planning: Designing action spaces and credit assignment for "thinking with images," with structured visual skills and cross-modal reward shaping.
  • Reward Model Design for Visual Generation: Integrating low-level signals with high-level human preferences, generalizing across modalities, and mitigating reward hacking.

Conclusion

Visual reinforcement learning has evolved into a robust research area bridging vision, language, and action. The survey demonstrates that progress is driven by scalable reward supervision, unified architectures, and rich benchmarks. Persistent challenges include data and compute efficiency, robust generalization, principled reward design, and comprehensive evaluation. Future developments will likely involve model-based planning, self-supervised pre-training, adaptive curricula, and safety-aware optimization. The survey provides a structured reference for advancing sample-efficient, reliable, and socially aligned visual decision-making agents.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

alphaXiv