Scaling video generation beyond short clips

Determine effective methodologies that enable video generation models to produce long-duration videos beyond short clips while maintaining coherent structure and quality. This problem concerns contemporary text-to-video and related generative approaches that currently perform well on short clips but do not yet robustly scale to longer sequences.

Background

The paper reviews progress in text-to-video generation, noting that recent diffusion transformer-based systems can create high-quality short clips. Despite these advances, reliably extending generation to longer durations remains challenging, motivating research into architectures, training strategies, and controls that preserve coherence and quality over extended sequences.

Within the broader context of single-shot and multi-shot generation, long-duration video brings compounded difficulties such as error accumulation, memory constraints, and narrative consistency across shots. The authors highlight that bridging the gap from short clips to scalable long video generation is an unresolved problem in the field.

References

However, scaling video generation beyond short clips remains an open problem.

MultiShotMaster: A Controllable Multi-Shot Video Generation Framework (2512.03041 - Wang et al., 2 Dec 2025) in Section 2.1 (Text-to-Video Generation)