Scaling behavior of joint video-audio generation models

Determine whether joint video-audio generative models can sustain continuous performance improvements as both training dataset size and model capacity increase, thereby clarifying the scalability of synchronized audio-visual generation systems beyond small-scale settings.

Background

The paper identifies three core challenges unique to video-audio generation: (1) building a high-quality audio-video captioning data pipeline, (2) achieving effective cross-modal fusion during simultaneous generation, and (3) verifying scalability. The third challenge is motivated by the observation that most open-source systems have only been evaluated with small models and limited data, leaving uncertainty about how performance evolves with scale.

Although MOVA is trained at substantial scale and shows improvements in lip synchronization and audio-visual alignment during training, the broader question of whether performance continues to improve monotonically with larger datasets and model sizes across joint audio-video models remains unresolved and is explicitly highlighted as open.

References

It remains an open question whether video-audio models can sustain continuous performance improvements with larger datasets and model scales.

MOVA: Towards Scalable and Synchronized Video-Audio Generation  (2602.08794 - Team et al., 9 Feb 2026) in Section 1 (Introduction)