Simultaneous real-time efficiency and high quality in purely autoregressive long video generation

Develop purely autoregressive (AR) long video generation models and training procedures that simultaneously achieve real-time inference efficiency and maintain high visual quality over long horizons in text-to-video generation.

Background

Autoregressive approaches to long video generation benefit from causal attention and key–value caching for fast inference, but they often degrade in quality over long horizons due to training on short clips and error accumulation during rollout.

While numerous diffusion and hybrid diffusion–autoregressive methods achieve strong visual quality, they are generally computationally expensive and struggle with real-time performance. The paper explicitly notes that, despite progress, achieving both real-time efficiency and high quality at once in purely AR models for long video generation has remained unresolved.

References

Despite the promise of purely AR for long video generation, achieving real-time efficiency and maintaining high quality simultaneously remains an open challenge.

— LongLive: Real-time Interactive Long Video Generation (2509.22622 - Yang et al., 26 Sep 2025) in Appendix, Section "General Related Work", subsection "Autoregressive Long Video Generation"

Simultaneous real-time efficiency and high quality in purely autoregressive long video generation

Sponsor

Background

References

Related Problems