Scaling behavior of joint video-audio generation models
Determine whether joint video-audio generative models can sustain continuous performance improvements as both training dataset size and model capacity increase, thereby clarifying the scalability of synchronized audio-visual generation systems beyond small-scale settings.
References
It remains an open question whether video-audio models can sustain continuous performance improvements with larger datasets and model scales.
— MOVA: Towards Scalable and Synchronized Video-Audio Generation
(2602.08794 - Team et al., 9 Feb 2026) in Section 1 (Introduction)