Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 71 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 22 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 88 tok/s Pro

Kimi K2 138 tok/s Pro

GPT OSS 120B 446 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching (2412.17153v2)

Published 22 Dec 2024 in cs.CV and cs.LG

Abstract: Autoregressive (AR) models have achieved state-of-the-art performance in text and image generation but suffer from slow generation due to the token-by-token process. We ask an ambitious question: can a pre-trained AR model be adapted to generate outputs in just one or two steps? If successful, this would significantly advance the development and deployment of AR models. We notice that existing works that try to speed up AR generation by generating multiple tokens at once fundamentally cannot capture the output distribution due to the conditional dependencies between tokens, limiting their effectiveness for few-step generation. To address this, we propose Distilled Decoding (DD), which uses flow matching to create a deterministic mapping from Gaussian distribution to the output distribution of the pre-trained AR model. We then train a network to distill this mapping, enabling few-step generation. DD doesn't need the training data of the original AR model, making it more practical. We evaluate DD on state-of-the-art image AR models and present promising results on ImageNet-256. For VAR, which requires 10-step generation, DD enables one-step generation (6.3$\times$ speed-up), with an acceptable increase in FID from 4.19 to 9.96. For LlamaGen, DD reduces generation from 256 steps to 1, achieving an 217.8$\times$ speed-up with a comparable FID increase from 4.11 to 11.35. In both cases, baseline methods completely fail with FID>100. DD also excels on text-to-image generation, reducing the generation from 256 steps to 2 for LlamaGen with minimal FID increase from 25.70 to 28.95. As the first work to demonstrate the possibility of one-step generation for image AR models, DD challenges the prevailing notion that AR models are inherently slow, and opens up new opportunities for efficient AR generation. The project website is at https://imagination-research.github.io/distilled-decoding.

Summary

The paper presents Distilled Decoding, a novel method that leverages flow matching and deterministic mapping to accelerate image autoregressive sampling.
Experimental results show a 6.3× speed-up on VAR and 217.8× acceleration on LlamaGen with only modest increases in FID scores.
The approach eliminates the need for original training data and lays the groundwork for efficient teacher-free distillation in generative modeling.

An Examination of Distilled Decoding for Autoregressive Model Acceleration

The paper "Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching" introduces an approach for enhancing the generative speed of autoregressive (AR) models utilized in text and image synthesis tasks. The primary goal of this research is to address the inherent inefficiency of AR models due to their token-by-token sampling process, a limitation which significantly hampers the deployment and real-time application of such models. The research offers Distilled Decoding (DD) as a novel methodological advance that combines flow matching with deterministic mapping to facilitate faster generation without relying on the training data initially used for the AR model.

Methodological Contributions

The researchers propose leveraging flow matching to create a deterministic mapping from a Gaussian distribution to the AR model's output distribution. Unlike traditional AR models that predict the next token based on conditional probabilities, DD trains a network to approximate this mapping, thereby enabling few-step execution. The key advantages of this approach include not requiring access to the original training data of the AR model and establishing a process that offers a substantial reduction in the generation time, demonstrated on state-of-the-art image models such as VAR and LlamaGen.

DD operates by constructing deterministic trajectories from random Gaussian noise using autoregressive flow matching. This process involves a recurrent conversion of noise to data, ensuring that the distribution of generated tokens aligns with that predicted by the AR model. In practice, such alignment ensures that the reduction in generation steps does not precipitously degrade sample quality, which is a noted limitation of many existing speed-up techniques.

Empirical Results

Experimental validations indicate that DD achieves notable reductions in generation time with only modest increases in Fréchet Inception Distance (FID), a measure of sample quality. Specifically, DD achieved a 6.3× speed-up on the VAR model with a FID shift from 4.19 to 9.96 and a dramatic 217.8× acceleration on LlamaGen, raising the FID from 4.11 to 11.35. Notably, these FID scores are substantially lower than baseline scores ( $>$ 100), which indicates the ineffectiveness of other methods with few-step generation on similar tasks.

Implications and Future Directions

The bridging of flow-based methods and autoregressive models represents a significant conceptual shift, opening pathways for efficient sampling while preserving high-quality outputs. The potential of DD extends beyond image models, inviting exploration within natural language processing and multi-modal generation frameworks, where the sequence lengths and computational demands are significantly higher.

For future exploration, the theoretical underpinnings of DD in handling diverse and larger-scale AR models could be expounded upon. The AR flow matching methodology sets a precedent for further research into deterministic trajectory constructs that bypass traditional bottlenecks of AR logic. This could evolve into developing teacher-free distillation pathways or synergy models that integrate the benefits of various paradigm efficiencies directly into model architectures.

In conclusion, this paper presents a promising approach to tackling one of the fundamental limitations of autoregressive models—slow token-by-token generation—offering a clear path forward in accelerating AR models without significant detriment to quality. The implications for practical deployment across different domains suggest a transformative phase for generative modeling capabilities.