Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion (2407.01392v4)

Published 1 Jul 2024 in cs.LG, cs.CV, and cs.RO

Abstract: This paper presents Diffusion Forcing, a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels. We apply Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens without fully diffusing past ones. Our approach is shown to combine the strengths of next-token prediction models, such as variable-length generation, with the strengths of full-sequence diffusion models, such as the ability to guide sampling to desirable trajectories. Our method offers a range of additional capabilities, such as (1) rolling-out sequences of continuous tokens, such as video, with lengths past the training horizon, where baselines diverge and (2) new sampling and guiding schemes that uniquely profit from Diffusion Forcing's variable-horizon and causal architecture, and which lead to marked performance gains in decision-making and planning tasks. In addition to its empirical success, our method is proven to optimize a variational lower bound on the likelihoods of all subsequences of tokens drawn from the true joint distribution. Project website: https://boyuan.space/diffusion-forcing

Authors (6)

Boyuan Chen (75 papers)
Diego Marti Monso (1 paper)
Yilun Du (113 papers)
Max Simchowitz (59 papers)
Russ Tedrake (91 papers)
Vincent Sitzmann (38 papers)

Citations (22)

View on Semantic Scholar

Summary

Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

The paper "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion" introduces a novel sequence generative model termed Diffusion Forcing (DF). This new paradigm synergizes the core benefits of next-token prediction models with those of full-sequence diffusion models to deliver enhanced capabilities in sequence modeling. The authors deploy DF across various domains, highlighting improvements in video prediction, decision-making, imitation learning, and time series forecasting.

Overview

Diffusion Forcing conceptualizes a new approach wherein a diffusion model denoises a set of tokens, each subjected to independent noise levels. This concept bridges the gap between next-token prediction and full-sequence diffusion models. DF facilitates the sequential generation of tokens, allowing for flexibility in sequence length and robustness in long-horizon generations. The model also incorporates unique sampling and guiding schemes that leverage its variable-horizon and causal architecture.

Technical Contributions

Next-Token Prediction and Full-Sequence Diffusion

Next-token prediction models, commonly trained via teacher forcing, predict the immediate next token based on the ground truth history of previous tokens. These models enable variable-length sequence generation, support efficient tree search, and are apt for online feedback control. However, they lack mechanisms for guiding the sampling of sequences to minimize certain objectives and often become unstable on continuous data for long horizons.

Conversely, full-sequence diffusion models excel in generating continuous signals, guiding sampling towards desirable sequences, and planning in decision-making applications. These models typically require non-causal, unmasked architectures, constraining their flexibility in variable-length generation and subsequence generation.

Diffusion Forcing (DF)

DF addresses the limitations of both approaches. It trains using a novel scheme where each token features a randomly assigned, independent noise level, and the tokens are denoised according to independent schedules via a shared model. The method employs a causal architecture (Causal Diffusion Forcing, CDF) such that future tokens depend on past noisy tokens.

The training objective for DF optimizes a variational lower bound on the expected log-likelihoods of sequences drawn from the true joint distribution, supporting a rigorous theoretical foundation. Such a model achieves flexible horizon control, stochastic rollouts, robustification to noisy observations, and long-horizon generation stability. It operates efficiently over large sequences by introducing slight noise during causal rollouts, thereby maintaining stability without compounding errors over time.

Empirical Results

The empirical validation spans several high-impact domains:

Video Prediction: DF stabilizes long-horizon generation markedly better than next-token diffusion and full-sequence diffusion. It generates temporally consistent sequences without divergence past the training horizon. The superior performance is attributed to noising stabilization.
Sequential Decision Making and Planning: Benchmarking on D4RL's maze environments, DF demonstrates higher rewards than state-of-the-art offline RL methods and Diffuser. The incorporation of Monte Carlo Tree Guidance (MCTG) and a "zig-zag" sampling schedule allows DF to outperform non-causal full-sequence diffusion in generating high-reward trajectories.
Robust Imitation Learning: DF's capability to integrate memory significantly improves long-horizon tasks such as robotic manipulation, outperforming diffusion policies lacking memory. Furthermore, it demonstrates robustness to visually noisy or occluded observations.
Time Series Forecasting: DF shows competitive performance in standard benchmarks, validating that the new paradigm does not trade-off general performance for specialized capabilities.

Practical and Theoretical Implications

The introduction of DF presents substantial practical benefits for various sequential modeling tasks. Practically, DF enables robust, flexible-horizon sequence modeling applicable to video generation, decision-making, robotic control, and other domains requiring stable long-term predictions. Theoretically, it extends the formulation of sequence models by integrating noise-based stochasticity effectively, fostering new explorations into noise-robust sequence models.

Future Directions

While the current work focuses on RNN-based causal architectures, scaling DF with transformers or other advanced architectures represents a promising avenue. Further, exploring modified noise schedules and guidance frameworks can enhance the denoising efficiency and quality of generation. Finally, applying DF to expansive and diverse datasets could lead to more generalized models suited for real-world applications.

In conclusion, Diffusion Forcing innovatively bridges next-token prediction with full-sequence diffusion, offering a robust, flexible, and theoretically grounded approach to sequence generative modeling. Its capabilities hold potential for significant advancements in AI-driven sequence manipulation and generation across various complex domains.

Related Papers

Find Related Papers

Tweets

https://twitter.com/HannesStaerk/status/1817683155903787185

https://twitter.com/papers_anon/status/1808066067681759623

https://twitter.com/mgostIH/status/1848441167274295756

https://twitter.com/1littlecoder/status/1809206264821403703

https://twitter.com/mgostIH/status/1835328656115126576

https://twitter.com/javaeeeee1/status/1809336184386048430

YouTube

Show All Videos

Reddit

"Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion", Chen et al 2024 (18 points, 1 comment)