MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence (2407.16655v2)

Published 23 Jul 2024 in cs.CV

Abstract: Recent advancements in video generation have primarily leveraged diffusion models for short-duration content. However, these approaches often fall short in modeling complex narratives and maintaining character consistency over extended periods, which is essential for long-form video production like movies. We propose MovieDreamer, a novel hierarchical framework that integrates the strengths of autoregressive models with diffusion-based rendering to pioneer long-duration video generation with intricate plot progressions and high visual fidelity. Our approach utilizes autoregressive models for global narrative coherence, predicting sequences of visual tokens that are subsequently transformed into high-quality video frames through diffusion rendering. This method is akin to traditional movie production processes, where complex stories are factorized down into manageable scene capturing. Further, we employ a multimodal script that enriches scene descriptions with detailed character information and visual style, enhancing continuity and character identity across scenes. We present extensive experiments across various movie genres, demonstrating that our approach not only achieves superior visual and narrative quality but also effectively extends the duration of generated content significantly beyond current capabilities. Homepage: https://aim-uofa.github.io/MovieDreamer/.

Citations (9)

View on Semantic Scholar

Summary

The paper introduces MovieDreamer, a hierarchical framework that combines autoregressive narrative modeling with diffusion rendering to generate coherent long visual sequences.
The paper employs a multimodal autoregressive model and an ID-preserving diffusion decoder that maintain character consistency and enhance high-fidelity visual rendering.
The paper demonstrates significant improvements over existing methods through extensive experiments, showing higher CLIP, IS, AS, and FID scores for both narrative coherence and image quality.

Hierarchical Generation for Coherent Long Visual Sequences

In the paper "Hierarchical Generation for Coherent Long Visual Sequences," the authors propose a novel hierarchical framework named MovieDreamer. This approach integrates autoregressive models with diffusion-based rendering to tackle challenges in generating coherent and high-fidelity long-duration videos. The paper addresses fundamental limitations in existing video generation techniques, particularly in modeling complex narratives and maintaining character consistency over extended periods.

Key Contributions

The authors present several noteworthy contributions:

Hierarchical Framework: MovieDreamer combines the strengths of autoregressive models for global narrative coherence and diffusion models for high-quality visual rendering. This hybrid approach effectively extends the duration of generated video content to thousands of keyframes, balancing long-term narrative coherence with short-term visual fidelity.
Multimodal Autoregressive Model: The autoregressive model generates visual token sequences conditioned on a novel multimodal script. This script enriches scene descriptions with detailed character information and visual style, thereby enhancing continuity and identity preservation.
ID-preserving Diffusion Rendering: A diffusion-based decoder is fine-tuned to preserve character identities, which mitigates errors in vision token prediction and enhances the overall visual continuity.
Keyframe-based Video Generation: The method employs keyframes to generate longer video clips using an image-to-video diffusion model, ensuring that the generated content retains high aesthetic quality and coherence across extended sequences.

Technical Details

The core innovation in MovieDreamer is its hierarchical structure. The approach involves the following key components:

Keyframe Tokenization: A diffusion autoencoder is used to represent keyframes as compact visual tokens. The encoder, leveraging a pretrained CLIP vision model, ensures that the latent representations capture essential semantics with minimal information loss.
Autoregressive Token Generation: Initialized from a pretrained LLaMA2-7B LLM, the autoregressive model $G$ predicts sequences of visual tokens. This model utilizes a Gaussian Mixture Model (GMM) to handle the continuous nature of image tokens, rather than traditional cross-entropy loss used in LLMs.
Anti-overfitting Mechanisms: The limited availability of high-quality long video datasets necessitated robust anti-overfitting strategies, including aggressive data augmentation, face embedding randomization, high dropout rates, and token masking.

Experimental Results

The authors evaluated MovieDreamer on a dataset comprising 50,000 long movies, augmented to create 5 million keyframes. The framework was rigorously tested to ensure it maintained both short-term (ST) and long-term (LT) character consistency, evaluated using metrics such as CLIP score, Inception Score (IS), Aesthetic Score (AS), and Frechet Image Distance (FID).

The results demonstrate that MovieDreamer significantly outperforms existing methods such as StoryDiffusion and StoryGen in terms of visual quality and narrative coherence. Specifically, higher CLIP scores indicate better alignment with storyline, while improved IS, AS, and FID scores reflect superior image quality.

Implications and Future Work

Practical Implications: MovieDreamer's hierarchical approach enables the generation of high-quality, long-duration videos, opening new possibilities for automated content creation in the film and entertainment industry. This technology can potentially streamline the production process, reduce costs, and allow for greater creative flexibility.

Theoretical Implications: By effectively combining autoregressive and diffusion models, the paper pushes the boundaries of generative modeling techniques. It provides a blueprint for future research on integrating different generative paradigms to address complex synthesis tasks.

Future Developments: The authors suggest several avenues for future work, including exploring more efficient sentence representations for multimodal scripts, enhancing keyframe-based video generation models to condition on multiple keyframes, and improving the controllability of generated content beyond character identity. Additionally, advancements in long sequence generation techniques could further extend the capabilities of MovieDreamer.

Conclusion

The work presented in "Hierarchical Generation for Coherent Long Visual Sequences" is a significant contribution to the field of generative AI. By successfully integrating autoregressive models with diffusion-based rendering, the authors have developed a framework capable of generating rich, coherent, and high-fidelity long-duration videos. This research not only addresses current limitations but also opens up exciting possibilities for future advancements in automated video production.

References

The paper leverages an extensive body of existing research, including foundational works on diffusion models, autoregressive models, and multimodal LLMs. Some key references include: