Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 102 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 30 tok/s

GPT-5 High 27 tok/s Pro

GPT-4o 110 tok/s

GPT OSS 120B 475 tok/s Pro

Kimi K2 203 tok/s Pro

2000 character limit reached

Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation (2412.01316v2)

Published 2 Dec 2024 in cs.CV, cs.AI, and cs.MM

Abstract: We introduce Presto, a novel video diffusion model designed to generate 15-second videos with long-range coherence and rich content. Extending video generation methods to maintain scenario diversity over long durations presents significant challenges. To address this, we propose a Segmented Cross-Attention (SCA) strategy, which splits hidden states into segments along the temporal dimension, allowing each segment to cross-attend to a corresponding sub-caption. SCA requires no additional parameters, enabling seamless incorporation into current DiT-based architectures. To facilitate high-quality long video generation, we build the LongTake-HD dataset, consisting of 261k content-rich videos with scenario coherence, annotated with an overall video caption and five progressive sub-captions. Experiments show that our Presto achieves 78.5% on the VBench Semantic Score and 100% on the Dynamic Degree, outperforming existing state-of-the-art video generation methods. This demonstrates that our proposed Presto significantly enhances content richness, maintains long-range coherence, and captures intricate textual details. More details are displayed on our project page: https://presto-video.github.io/.

Collections

Summary

The paper introduces a segmented cross-attention mechanism to maintain long-range coherence in diffusion-based video generation.
The paper curates the LongTake-HD dataset with 261,000 video-text pairs to enhance narrative consistency in long video sequences.
Experimental results demonstrate 78.5% semantic and 100% dynamic scores, outperforming state-of-the-art models in content richness and dynamic transitions.

Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation

The paper under review introduces Presto, a novel approach to long video generation within the framework of diffusion models. The core innovation lies in the Segmented Cross-Attention (SCA) mechanism, designed to maintain long-range coherence and content richness over extended durations of video. Additionally, the authors present the LongTake-HD dataset, meticulously curated to support the generation of prolonged, coherent video narratives enriched with textual annotations.

Key Technical Contributions

Segmented Cross-Attention (SCA): Presto capitalizes on a modified diffusion transformer that divides video content into temporal segments, each capable of attending to progressive sub-captions. This segmentation strategy allows detailed and coherent video narratives, addressing the limitations faced by traditional single-caption approaches.
Data Curation - LongTake-HD: Recognizing the paucity of high-quality datasets for long-form video generation, the authors curated LongTake-HD with 261,000 video-text pairs showcasing long-range scenario coherence. The dataset facilitates training by offering diverse visual content and corresponding structured sub-captions, which are essential for enhancing model performance in generating coherent video sequences.

Experimental Evaluation

Quantitative evaluation is conducted using the VBench benchmark, where Presto achieves a 78.5% score on the Semantic Score and a commendable 100% on the Dynamic Degree metrics. These results indicate superior performance in both content richness and capturing dynamic transitions compared to state-of-the-art models, including Allegro and the commercial Gen-3 system. Qualitatively, user studies highlight Presto's ability to maintain scenario diversity and coherence, outperforming competitive baselines.

Theoretical and Practical Implications

The introduction of SCA as part of the diffusion model architecture provides a framework for enhanced information exchange between text and video features, facilitating the generation of extended and coherent video sequences. The methodology is extensible to other multimodal generation tasks that require maintaining long-term contextual consistency.

Practically, Presto's ability to generate long videos with rich narratives addresses the needs of content creators and industries engaged in automated media production, enhancing creative workflows with minimal human intervention.

Future Directions

While Presto is a significant step forward, exploring variable-length segmentation strategies and adaptive attention mechanisms could further enhance the model's flexibility and performance. Additionally, integrating advanced techniques for real-time caption generation from diverse languages will broaden the model's applicability in global contexts.

In conclusion, the paper presents a substantial advancement in long-video generation by innovatively combining segmentation strategies with curated datasets to achieve high-quality, coherent video narratives. These methodological contributions and empirical results position Presto as a powerful tool for multimedia content creation in evolving digital environments.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (6)

GitHub

Presto: Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation

Tweets

https://twitter.com/cakeyan9/status/1895043105373724690