Conjecture on limited scene variability causing minimal gains from VLM fine-tuning on LIBERO

Ascertain whether the limited performance gains observed when fine-tuning the vision-language model (VLM) via Low-Rank Adaptation (LoRA) alongside the flow-based action expert during reinforcement learning of the pi_0 model on the LIBERO-Long benchmark are primarily attributable to the limited scene variability in LIBERO that renders the pretrained VLM representations already sufficiently robust.

Background

The paper investigates whether jointly fine-tuning the VLM (using LoRA with rank 32 and alpha 32) together with the flow-based action expert improves performance during RL on LIBERO-Long. Empirically, the LoRA-II configuration produced a learning trajectory comparable to the frozen-VLM baseline, suggesting limited benefit from VLM fine-tuning in this setting.

Based on these results, the authors conjecture that the limited performance gains stem from LIBERO’s limited scene variability, implying that pretrained VLM features are already robust enough. Verifying this conjecture would clarify when VLM fine-tuning is beneficial versus unnecessary in flow-based VLA RL on LIBERO.

References

We conjecture the limited performance gain attributable to the limited scene variability within LIBERO, for which the pretrained VLM representations are already sufficiently robust.

— $π_\texttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models (2510.25889 - Chen et al., 29 Oct 2025) in Section: Extension: Fine-tune VLM and Action Expert Simultaneously

Conjecture on limited scene variability causing minimal gains from VLM fine-tuning on LIBERO

Sponsor

Background

References

Related Problems