Achieving long-term consistent video generation

Determine effective methods to achieve long-term temporally consistent video generation in text-to-video systems, particularly those based on Diffusion Transformers (DiTs), ensuring coherence across extended durations and addressing challenges in temporal consistency.

Background

The paper situates the problem within recent progress in Diffusion Transformers (DiTs) for video generation, noting that despite rapid advancements, maintaining temporal consistency over long durations remains unresolved. This challenge is central to producing coherent, high-motion videos that align with textual semantics.

CogVideoX proposes a 3D causal VAE, expert transformer layers, and 3D full attention to improve temporal consistency and text-video alignment, but the authors explicitly acknowledge that how to achieve long-term consistency, in general, is still technically unclear, motivating their architectural and training choices.

References

Despite these rapid advancements in DiTs, it remains technically unclear how to achieve long-term consistent video generation.

— CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer (2408.06072 - Yang et al., 12 Aug 2024) in Section 1 (Introduction)

Achieving long-term consistent video generation

Sponsor

Background

References

Related Problems