Conjecture on limited scene variability causing minimal gains from VLM fine-tuning on LIBERO
Ascertain whether the limited performance gains observed when fine-tuning the vision-language model (VLM) via Low-Rank Adaptation (LoRA) alongside the flow-based action expert during reinforcement learning of the pi_0 model on the LIBERO-Long benchmark are primarily attributable to the limited scene variability in LIBERO that renders the pretrained VLM representations already sufficiently robust.
References
We conjecture the limited performance gain attributable to the limited scene variability within LIBERO, for which the pretrained VLM representations are already sufficiently robust.
— $π_\texttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models
(2510.25889 - Chen et al., 29 Oct 2025) in Section: Extension: Fine-tune VLM and Action Expert Simultaneously