Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

133 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Improving Vision-Language-Action Model with Online Reinforcement Learning (2501.16664v1)

Published 28 Jan 2025 in cs.RO, cs.CV, and cs.LG

Abstract: Recent studies have successfully integrated large vision-LLMs (VLMs) into low-level robotic control by supervised fine-tuning (SFT) with expert robotic datasets, resulting in what we term vision-language-action (VLA) models. Although the VLA models are powerful, how to improve these large models during interaction with environments remains an open question. In this paper, we explore how to further improve these VLA models via Reinforcement Learning (RL), a commonly used fine-tuning technique for large models. However, we find that directly applying online RL to large VLA models presents significant challenges, including training instability that severely impacts the performance of large models, and computing burdens that exceed the capabilities of most local machines. To address these challenges, we propose iRe-VLA framework, which iterates between Reinforcement Learning and Supervised Learning to effectively improve VLA models, leveraging the exploratory benefits of RL while maintaining the stability of supervised learning. Experiments in two simulated benchmarks and a real-world manipulation suite validate the effectiveness of our method.

Summary

The paper proposes the iRe-VLA framework to decouple online exploration from full-model updates, mitigating instability in RL fine-tuning of large VLA models.
By freezing the VLM backbone during RL and training a lightweight action head, the approach reduces computational costs and risks of catastrophic forgetting.
Supervised integration using PEFT techniques like LoRA on combined expert and RL-generated datasets ensures improved performance and generalization across robotic tasks.

Vision-Language-Action (VLA) models, typically initialized via Supervised Fine-Tuning (SFT) on large-scale expert demonstration datasets, represent a significant advancement in bridging multimodal understanding with robotic control. However, SFT has limitations: it relies heavily on the availability and quality of expert data, which is often expensive to collect, and may not fully capture the nuances of interaction within specific physical environments or adapt to novel scenarios not present in the training data. Online Reinforcement Learning (RL) offers a compelling alternative for refining these models through direct environmental interaction, enabling adaptation and improvement beyond the initial SFT phase. Nevertheless, applying online RL directly to large, multi-billion parameter VLA models presents substantial hurdles related to training stability and computational demands.

Challenges with Direct Online RL for VLAs

Direct application of standard online RL algorithms (e.g., Proximal Policy Optimization (PPO), Soft Actor-Critic (SAC)) to fine-tune the entirety of a large VLA model during environmental interaction encounters several critical issues:

Training Instability: Large pre-trained models, particularly Vision-LLMs (VLMs) forming the backbone of VLAs, are sensitive to the noisy gradients inherent in RL updates, especially in sparse-reward or long-horizon robotic tasks. This can lead to catastrophic forgetting of the knowledge acquired during pre-training and SFT, often resulting in significant performance degradation or even complete model collapse. The high variance of RL gradients can destabilize the carefully learned representations within the VLM.
Computational Burden: Fine-tuning multi-billion parameter models requires substantial computational resources (memory and processing power). Performing full model updates at the frequency required by online RL algorithms is often infeasible on typical local compute setups (e.g., single GPU workstations) commonly used for robotic interaction experiments. This limits the practicality of online RL for researchers and practitioners without access to large-scale distributed training infrastructure.

The iRe-VLA Framework

To address these challenges, the iterative Reinforcement-VLA (iRe-VLA) framework is proposed (2501.16664). This framework decouples the exploratory aspect of RL from the large-scale model update process, iterating between two distinct stages to achieve stable and computationally feasible online improvement.

Stage 1: Stabilized Online RL Exploration

In the first stage, the VLA agent interacts with the target environment using an online RL algorithm. The key characteristic of this stage is that the parameters of the large VLM backbone are kept frozen. Only a lightweight action head (e.g., a small Multi-Layer Perceptron mapping VLM features to actions) and potentially an associated critic network are trained using the RL objective function.

RL Algorithms: Standard algorithms like PPO or SAC combined with demonstrations (SACfD) can be employed. For instance, using PPO involves collecting trajectories $(s_t, a_t, r_t, s_{t+1})$ and updating the policy (action head) $\pi_\theta$ and value function $V_\phi$ based on the policy gradient objective, often incorporating Generalized Advantage Estimation (GAE).
Objective: The primary goal is efficient exploration and discovery of successful task behaviors within the interactive environment. By freezing the VLM, the process is stabilized; the RL updates do not risk corrupting the rich, pre-trained representations.
Computational Benefit: Training only the small action/critic heads significantly reduces the memory footprint and computational cost, making this stage viable on standard hardware (e.g., a single consumer-grade GPU).
Data Collection: Successful trajectories generated during this phase, potentially filtered based on task success or cumulative reward, are collected into an online dataset, $D_{RL}$ .

Following the online RL stage, the second stage focuses on integrating the newly acquired knowledge into the full VLA model using Supervised Learning (SL).

Model Update: In this stage, the entire VLA model is updated. To manage the computational aspect of updating the large VLM backbone, Parameter-Efficient Fine-Tuning (PEFT) techniques, such as Low-Rank Adaptation (LoRA), are employed. LoRA injects trainable low-rank matrices into the frozen VLM layers, allowing adaptation with significantly fewer trainable parameters compared to full fine-tuning. The action head is also updated during this SL phase.
Training Data: The model is trained on a combined dataset consisting of the original expert demonstration dataset ( $D_e$ ) used for the initial SFT and the successful online trajectories ( $D_{RL}$ ) collected during Stage 1. This mixing strategy helps prevent catastrophic forgetting of the behaviors learned from the expert data while incorporating the novel strategies discovered via RL.
Loss Function: A standard SL loss, typically Mean Squared Error (MSE) between the predicted actions and the actions recorded in the combined dataset ( $D_e \cup D_{RL}$ ), is used:

$L_{MSE} = \frac{1}{|D_e \cup D_{RL}|} \sum_{(s, a) \in D_e \cup D_{RL}} || \pi(s) - a ||^2$

where $\pi(s)$ is the action predicted by the VLA model given state observation $s$ .
Computational Management: While more computationally intensive than Stage 1 due to the involvement of the VLM (even with LoRA), this SL stage is performed offline. It can be executed on more powerful compute resources or distributed systems if necessary, decoupled from the real-time constraints of online interaction.

The framework iterates between Stage 1 and Stage 2, allowing the VLA model to progressively explore, learn, and integrate new behaviors from online interaction in a stable and resource-efficient manner.

Implementation Considerations

Model Architecture and Training Details

A typical VLA architecture consists of a pre-trained VLM backbone (e.g., based on Llama or ViT architectures) followed by an action head.

VLM Backbone: Remains frozen during Stage 1 RL. During Stage 2 SL, LoRA adapters are introduced and trained alongside the action head. Hyperparameters for LoRA (e.g., rank $r$ , scaling factor $\alpha$ , target modules) need careful tuning.
Action Head: Typically a small MLP, trained via RL (e.g., PPO actor loss) in Stage 1 and via SL (MSE loss) in Stage 2.
Critic Network (Optional): If using actor-critic RL methods like PPO, a critic head (also typically an MLP) is trained alongside the actor head in Stage 1 using the value loss objective. It might be discarded or re-initialized in subsequent iterations.
Optimizers and Learning Rates: Standard optimizers like AdamW are suitable. Learning rates need careful scheduling, potentially using smaller rates for the LoRA adapters compared to the action/critic heads.

Data Management and Iteration Strategy

Trajectory Filtering: Defining "success" to filter trajectories for $D_{RL}$ is crucial. This could be based on reaching a goal state, exceeding a reward threshold, or task-specific metrics.
Data Ratio: The mixing ratio between $D_e$ and $D_{RL}$ in Stage 2 can influence performance. An equal mix or weighting based on dataset size or quality might be considered.
Iteration Frequency: The number of environment steps in Stage 1 and the number of SL training epochs in Stage 2 per iteration is a key hyperparameter. Longer exploration phases might yield more diverse data, while more extensive SL training ensures better integration.

Computational Resource Allocation

The framework explicitly manages computational load:

Stage 1: Designed for single-GPU execution, compatible with typical robotics setups. Memory usage is dominated by the frozen VLM's inference cost plus the small trainable heads.
Stage 2: Higher memory and compute requirements due to gradient computation through LoRA adapters. Can be offloaded to more powerful machines or run less frequently.

Experimental Validation and Results

Experiments were conducted in simulation environments (MetaWorld MT10, MT50; Franka-Kitchen) and on a real-world Franka Emika Panda robot arm (2501.16664).

Baselines: Performance was compared against the initial SFT-trained VLA model and a standard online RL approach (PPO-Replay) that fine-tuned the entire model (using LoRA for the VLM) throughout the RL process, augmented with replay from the expert dataset $D_e$ .
Stability: The PPO-Replay baseline often exhibited instability, with performance frequently degrading below the initial SFT level, validating the instability challenge of direct online RL fine-tuning.
Performance Gains: iRe-VLA consistently demonstrated stable learning and achieved significant improvements over the SFT baseline across various task categories:
- Original Expert Tasks: Performance was maintained or slightly improved, indicating mitigation of catastrophic forgetting.
- RL-Trained Tasks: New tasks introduced only during the online RL phase were successfully learned, showcasing the framework's ability to acquire novel skills.
- Unseen Hold-out Tasks: Generalization to entirely new, unseen tasks also improved, suggesting that the iterative RL exploration and SL integration enhance the overall robustness and capability of the VLA model's representations.
Ablation Study: An ablation (iRe-VLA-freeze) where the VLM parameters (including LoRA adapters) were kept frozen even during Stage 2 showed significantly lower performance compared to the full iRe-VLA. This confirmed the necessity of updating the VLM representations via SL in Stage 2 to effectively integrate the knowledge gained during RL exploration and improve generalization.

The results empirically validate that the iRe-VLA framework provides a practical and effective method for improving large VLA models using online RL by successfully navigating the trade-offs between exploration, stability, and computational feasibility.

In conclusion, the iRe-VLA framework offers a structured approach to leveraging online RL for refining large VLA models. By alternating between stabilized RL exploration with a frozen VLM and supervised integration using PEFT on combined data, it mitigates the instability and computational issues associated with direct end-to-end RL fine-tuning, enabling practical and demonstrable performance improvements in complex robotic manipulation tasks.

PDF Markdown

Tweets

https://twitter.com/RoboReading/status/1884473082053599459

https://twitter.com/fly51fly/status/1884719020064333860

YouTube

Show All Videos