Robustness of Video-R4 across diverse domains and larger model scales

Determine the robustness of the Video-R4 system when applied to domains beyond M4-ViteVQA and related text-centric datasets, and when scaled to model backbones larger than 7B.

Background

The paper introduces Video-R4, a video reasoning agent trained primarily on text-rich video datasets, notably M4-ViteVQA, and evaluated using a 7B backbone. While the model achieves state-of-the-art results and demonstrates transfer to several benchmarks, the authors explicitly note that the training data are limited in domain diversity and the experiments are constrained to a 7B model size.

In the Limitations section, the authors explicitly flag the unresolved issue of robustness when extending Video-R4 to more diverse domains and scaling beyond the 7B backbone, indicating a need for further evaluation or methodological development to address potential performance or stability concerns under these conditions.

References

Third, our training data are primarily derived from M4-ViteVQA and a few related text-centric datasets, and experiments are conducted on a 7B backbone, leaving open questions about robustness under more diverse domains and larger model scales.

Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination  (2511.17490 - Tang et al., 21 Nov 2025) in Limitations (Supplementary), page 1