Synthesizing sustained long-duration video footage

Develop video-generation methods—particularly diffusion-based models—that can synthesize sustained footage spanning minutes or longer, overcoming the current limitation of generating only short clips of approximately 2–10 seconds.

Background

The paper observes that while diffusion models have achieved strong results in video generation, existing systems are typically constrained to short clips (roughly 2–10 seconds). This limitation prevents generation of videos with extended temporal coherence and consistent quality over minutes.

The authors introduce MALT Diffusion to address long-term contextual understanding and stability via memory-augmented latent transformers, but they highlight that producing sustained footage over minutes remains explicitly open, motivating their architectural and training contributions.

References

Synthesizing sustained footage (\eg~over minutes) still remains an open research question.

— MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation (2502.12632 - Yu et al., 18 Feb 2025) in Abstract, page 1

Synthesizing sustained long-duration video footage

Background

References

Related Problems