SimpleVLA-RL: Scalable VLA Reinforcement Learning

Updated 13 September 2025

SimpleVLA-RL is an online reinforcement learning framework that integrates vision, language, and action for scalable robotic manipulation.
It introduces VLA-specific innovations such as dynamic trajectory sampling, parallelized environment interaction, and outcome-based reward assignment to enhance learning efficiency.
Empirical results show significant improvements over supervised fine-tuning, driving emergent behaviors and robust sim-to-real transfer in complex tasks.

SimpleVLA-RL is an efficient online reinforcement learning framework designed for scaling Vision-Language-Action (VLA) model training, which addresses longstanding challenges in data efficiency and generalization in robotic manipulation. Building upon advances in large-scale vision-LLMs and the veRL platform for LLM RL, SimpleVLA-RL introduces VLA-specific innovations in trajectory sampling, parallelized environment interaction, and optimized policy updates. The framework is motivated by the need to circumvent the high cost and limited robustness associated with supervised fine-tuning (SFT) on large-scale human demonstration datasets, and it leverages reinforcement learning to enhance step-by-step policy reasoning, exploration, and generalization.

1. Framework Architecture and Interactive Training

SimpleVLA-RL extends veRL by integrating components specialized for interaction between VLA models and multi-environment robotic simulators. The system’s core workflow involves:

Interactive Rollout Algorithm: The policy, parameterized by a VLA transformer backbone (e.g., OpenVLA-OFT), generates action token distributions at each time step, conditioned on the current environment state (visual, proprioceptive, and language inputs). Actions are sampled with temperature sampling, and the resulting commands are executed by the simulated robot, altering the environment state for subsequent policy inputs.
Parallelized Multi-Environment Rendering: Efficient data throughput is achieved by executing rollouts in parallel across a pool of environments, capturing multimodal feedback and enabling rapid accumulation of diverse experience trajectories.
Outcome-Based Reward Assignment: Instead of process- or step-level shaping, a simple terminal outcome reward is used: $R(a_{i,t} | s_{i,t}) = 1$ if the entire trajectory $i$ achieves success, and $0$ otherwise. This reward is retroactively assigned to all tokens along a successful trajectory, aligning learning with overall task completion.

This architectural approach decouples environment-specific complexities from the reinforcement learning loop and ensures scalable throughput for large-batch policy optimization.

2. Objectives, Policy Optimization, and RL Enhancements

SimpleVLA-RL adopts Group Relative Policy Optimization (GRPO) as its primary policy update mechanism. The GRPO objective is expressed as: $\mathcal{J}(\theta) = \mathbb{E}_{s_0, \{a_t\}_{i=1}^G} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|\tau_i|}\sum_{t=1}^{|\tau_i|} \min \left( r_{i,t}(\theta) \hat{A}_i, ~\mathrm{clip}(r_{i,t}(\theta), 1-\varepsilon_\mathrm{low}, 1+\varepsilon_\mathrm{high}) \hat{A}_i \right) \right]$ where $r_{i,t}(\theta)$ is the importance sampling ratio for the current policy, and $\hat{A}_i$ is a standardized advantage computed across the group of $G$ trajectories. Stability is enforced by requiring that each group contains both successes and failures, ensuring non-trivial policy gradients.

Reinforcement learning is further enhanced through:

Dynamic Sampling: Only trajectory groups with a mixture of <em>at least one</em> success and failure are included for updates, preventing vanishing gradients observed when all rewards collapse to identical values.
Exploration Modulation: The action sampling temperature is raised (e.g., 1.0 → 1.6), and the upper PPO/GRPO importance ratio clipping bound is relaxed (e.g., 1.2 → 1.28), encouraging exploration and preventing premature convergence to high-likelihood yet suboptimal trajectories.

These adjustments distinguish SimpleVLA-RL from standard RL implementations that often optimize conservative behavior in the presence of sparse, binary rewards.

3. Tackling Data Scarcity and Distribution Shifts

A fundamental barrier to scaling VLA models lies in the scarcity of high-quality, diverse demonstration data. SFT methods require extensive manual collection of robot-human paired trajectories, leading to overfitting and poor resilience to unseen scene configurations or object variations. SimpleVLA-RL alleviates this data bottleneck by:

Relying on interactive reinforcement rather than behavior cloning, which enables effective learning from as little as a single demonstration per task.
Empowering the policy to actively explore the action space through outcome-driven objective shaping. The sparse reward, propagated across all action tokens, requires the policy to discover a robust solution rather than memorizing expert traces.
Demonstrating dramatic generalization improvements in out-of-distribution (OOD) tasks, e.g., spatial rearrangements, novel object placements, and previously unseen goal states.

Empirical evidence in the LIBERO and RoboTwin 1.0/2.0 benchmarks substantiates these gains, with SimpleVLA-RL outperforming SFT and prior state-of-the-art algorithms by significant success rate margins in both single- and dual-arm manipulation scenarios.

4. Emergent Policy Behaviors and “Pushcut” Phenomenon

A salient advantage of the SimpleVLA-RL outcome-based RL strategy is the emergence of novel, previously unseen behavioral routines. During RL training, the model is observed to:

Spontaneously discover the “pushcut” maneuver—directly pushing an object into the target area rather than executing the canonical grasp–transport–place sequence demonstrated in SFT data.
Adapt action selection and manipulation subroutines dynamically in response to stochastic environment feedback, thereby escaping local optima induced by imitation learning priors.

This emergent behavior underscores the potential of sparse, outcome-based RL frameworks to drive policy innovation, not merely fidelity to expert demonstrations, and suggests a pathway for further optimizing efficiency (e.g., time- or energy-minimizing manipulation).

5. Quantitative Performance and Real-World Transfer

SimpleVLA-RL delivers substantial quantitative improvements. For example:

On the LIBERO suite, applying SimpleVLA-RL to OpenVLA-OFT lifts the SFT baseline success rate from ~91% to 99%, with long-horizon tasks improving by ~12 percentage points (86.5% → 98.5%). The RL objective led to 7.8–13.3% gains over models such as $\pi_0$ and UniVLA.
For RoboTwin dual-arm tasks, average success rates climb from 39.8% to 70.4% (RoboTwin1.0) and by 22–30% points (RoboTwin2.0) compared to SFT-only training.
In a stringent data efficiency test (one demonstration per task), performance improves from ~49% (SFT) to ~97% (SimpleVLA-RL).

Experiments with real-world AgileX Piper arms demonstrate that simulation-trained RL policies (e.g., for “Stack Bowls,” “Pick Bottle”) transfer robustly to physical systems, with baseline SFT models attaining ~17.5% and SimpleVLA-RL raising this to ~38.5%, a ~2× improvement. This suggests the framework’s architectural and algorithmic choices are effective for sim-to-real transfer in practical settings.

6. Future Directions and Implications

Potential advancements and open questions highlighted include:

Adaptive Exploration and Curriculum: While current exploration-promoting hyperparameters (high sampling temperature, relaxed clipping) are effective, a principled curriculum or adaptive adjustment could bolster learning, particularly for extremely long-horizon or high-dimensional tasks.
Reward Design: The binary outcome reward, while effective for generalization and emergent behavior, may be further refined with hybrid process-outcome signals to encode additional constraints (e.g., safety, efficiency, multi-objective trade-offs).
Scaling to Multi-modal and Heterogeneous Tasks: The modular nature of SimpleVLA-RL is compatible with environments incorporating vision, tactile, and language grounding; future experiments may extend its utility to more complex, multi-agent, or hierarchical tasks.
Integration with Automated Data Generation: Coupling RL with automated demonstration or augmentation strategies could further drive down the data requirements for robust VLA deployment.

By addressing the dual problems of data efficiency and generalization in VLA training, SimpleVLA-RL elevates RL from a supplementary finetuning tool to a central paradigm in vision-language-robotics integration. The framework’s interactive, scalable structure and robust performance metrics establish new baselines for outcome-driven action policy learning in both simulation and real-world robotic manipulation (Li et al., 11 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to SimpleVLA-RL.