Balancing SFT and RLVR in training video-LLMs

Determine the appropriate balance between supervised fine-tuning and reinforcement learning with verifiable rewards when training video large language models to achieve robust spatio-temporal video understanding.

Background

The paper empirically examines training paradigms and finds complementary strengths for supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR), including task-dependent advantages.

Despite these observations, the authors highlight that the optimal balance between SFT and RLVR for video understanding remains unresolved, identifying it as an open question for future research.

References

We further identify an underexplored area in training paradigms, raising the balance between SFT and RLVR as an open question.

Video-Oasis: Rethinking Evaluation of Video Understanding  (2603.29616 - Lim et al., 31 Mar 2026) in Conclusion