Balancing SFT and RLVR in training video-LLMs
Determine the appropriate balance between supervised fine-tuning and reinforcement learning with verifiable rewards when training video large language models to achieve robust spatio-temporal video understanding.
References
We further identify an underexplored area in training paradigms, raising the balance between SFT and RLVR as an open question.
— Video-Oasis: Rethinking Evaluation of Video Understanding
(2603.29616 - Lim et al., 31 Mar 2026) in Conclusion