Fine-tuning pre-trained vision-language-action policies with online rewards

Determine a robust and sample-efficient procedure to fine-tune pre-trained vision-language-action policy models using online reward feedback during robot interaction, so that these models can effectively learn new manipulation tasks in real-world settings without requiring task-specific demonstrations.

Background

The paper proposes ReWiND, a framework that learns a language-conditioned reward model and uses it to pre-train and then fine-tune robot policies for unseen tasks. While ReWiND demonstrates substantial gains in simulation and real-world experiments, the authors note that stronger policy architectures—such as pre-trained vision-language-action (VLA) models—could further improve performance.

However, transitioning these large pre-trained policies to online reinforcement learning settings is challenging. The authors explicitly state that the best method to fine-tune such models with online rewards is unresolved, citing practical bottlenecks like computational cost and training time during real-world interaction. This motivates an open problem focused on developing an effective methodology for online reward-based fine-tuning of VLA models.

References

However, the best way to fine-tune these models with online rewards remains an open challenge ~\citep{nakamoto2024steering, guo2025improvingvisionlanguageactionmodelonline}.

— ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations (2505.10911 - Zhang et al., 16 May 2025) in Section 6: Limitations

Fine-tuning pre-trained vision-language-action policies with online rewards

Sponsor

Background

References

Related Problems