History-Guided Video Diffusion (2502.06764v1)

Published 10 Feb 2025 in cs.LG and cs.CV

Abstract: Classifier-free guidance (CFG) is a key technique for improving conditional generation in diffusion models, enabling more accurate control while enhancing sample quality. It is natural to extend this technique to video diffusion, which generates video conditioned on a variable number of context frames, collectively referred to as history. However, we find two key challenges to guiding with variable-length history: architectures that only support fixed-size conditioning, and the empirical observation that CFG-style history dropout performs poorly. To address this, we propose the Diffusion Forcing Transformer (DFoT), a video diffusion architecture and theoretically grounded training objective that jointly enable conditioning on a flexible number of history frames. We then introduce History Guidance, a family of guidance methods uniquely enabled by DFoT. We show that its simplest form, vanilla history guidance, already significantly improves video generation quality and temporal consistency. A more advanced method, history guidance across time and frequency further enhances motion dynamics, enables compositional generalization to out-of-distribution history, and can stably roll out extremely long videos. Website: https://boyuan.space/history-guidance

Authors (6)

Kiwhan Song (2 papers)
Boyuan Chen (75 papers)
Max Simchowitz (59 papers)
Yilun Du (113 papers)
Russ Tedrake (91 papers)
Vincent Sitzmann (38 papers)

Summary

The paper introduces the Diffusion Forcing Transformer (DFoT) and History Guidance (HG) for flexible, variable-length conditioning in video diffusion models.
DFoT employs a novel "noise-as-masking" paradigm with frame-specific noise sampling to improve performance and handle diverse history lengths during inference.
Experimental results show DFoT generates high-quality, temporally consistent videos and can be efficiently fine-tuned, suggesting utility for scalable video generation applications.

History-Guided Video Diffusion: An Analysis

This paper, authored by researchers from MIT, introduces an innovative approach to video generation using diffusion models, known as the Diffusion Forcing Transformer (DFoT). The primary aim of this work is to address the challenge of conditioning video diffusion models on variable and flexible history lengths, enhancing the control and quality of generated samples.

Background and Challenges

Diffusion models have proven effective across several generative tasks in domains such as images and videos, primarily due to their ability to balance sample quality and diversity. A common technique, classifier-free guidance (CFG), facilitates this balance by training both conditional and unconditional models. However, when applying CFG to video diffusion, new challenges arise, particularly when conditioning on variable-length sequences of past video frames—termed 'history.'

Two primary challenges are identified: firstly, existing architectures predominantly accommodate fixed-size conditioning, limiting flexibility. Secondly, traditional CFG methods, when extended to variable-length history, result in significant performance degradation. This paper seeks to overcome these limitations with the proposed DFoT.

Diffusion Forcing Transformer and History Guidance

DFoT, as proposed, introduces a framework that allows for flexible conditioning on different sequences of history frames. This framework leverages the "noise-as-masking" paradigm, where each frame in a sequence is masked with a variable level of noise. This conceptually treats the history frames as noise-free, allowing for dynamic conditioning lengths during inference.

The training objective of DFoT involves independently sampling noise levels for each frame, optimizing a noise prediction objective. This approach contrasts traditional methods, where uniform noise levels are applied. By optimizing a variational lower bound on expected log-likelihoods, DFoT achieves both improved performance and flexibility.

Complementing DFoT, the paper introduces History Guidance (HG), a suite of methods for history-conditioned video generation. The simplest form, Vanilla History Guidance (HG-v), uses CFG directly with a chosen history length. More advanced methods, such as Temporal History Guidance (HG-t) and Fractional History Guidance (HG-f), further refine the guidance process by considering different historical subsequences and fractional masking levels, respectively.

Experimental Validation

The paper validates DFoT and its associated history guidance methods across several datasets, including Kinetics-600 and RealEstate10K. The results indicate that DFoT outperforms existing models by generating high-quality, temporally consistent videos. Importantly, DFoT's flexible architecture allows it to be fine-tuned efficiently from pre-trained models, demonstrating practical applicability.

Key metrics such as Fréchet Video Distance (FVD) and individual component scores from VBench were used to evaluate performance in terms of video quality, consistency, and dynamics. DFoT consistently achieved superior results, particularly in scenarios demanding long temporal consistency and robust handling of out-of-distribution histories.

Implications and Future Directions

The introduction of DFoT and History Guidance implies significant implications for practical applications of video generation, such as autonomous navigation and interactive media. By enabling flexible, history-conditioned sampling, these methods provide a scalable solution to generate longer, more detailed and coherent video sequences.

The paper hints at future prospects, such as integrating this framework with large foundational models and exploring more nuanced guidance strategies like attention-based mechanisms. Moreover, the ability of DFoT to yield efficient fine-tuning from existing models suggests its potential role in environments with limited computational resources or when rapid model development is necessary.

In summary, History-Guided Video Diffusion, through the Diffusion Forcing Transformer, offers a robust pathway to enhance conditional video generation. By overcoming architectural and methodological constraints of existing frameworks, this work sets a foundation for more adaptable and dynamic generative systems in the evolving landscape of artificial intelligence.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1889192017281876070

https://twitter.com/rohanpaul_ai/status/1890370139113054274

https://twitter.com/arXivGPT/status/1889738187666231404

YouTube

Show All Videos