Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 80 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 33 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 117 tok/s Pro

Kimi K2 176 tok/s Pro

GPT OSS 120B 457 tok/s Pro

Claude Sonnet 4.5 32 tok/s Pro

2000 character limit reached

Autoregressive Distillation of Diffusion Transformers (2504.11295v1)

Published 15 Apr 2025 in cs.CV

Abstract: Diffusion models with transformer architectures have demonstrated promising capabilities in generating high-fidelity images and scalability for high resolution. However, iterative sampling process required for synthesis is very resource-intensive. A line of work has focused on distilling solutions to probability flow ODEs into few-step student models. Nevertheless, existing methods have been limited by their reliance on the most recent denoised samples as input, rendering them susceptible to exposure bias. To address this limitation, we propose AutoRegressive Distillation (ARD), a novel approach that leverages the historical trajectory of the ODE to predict future steps. ARD offers two key benefits: 1) it mitigates exposure bias by utilizing a predicted historical trajectory that is less susceptible to accumulated errors, and 2) it leverages the previous history of the ODE trajectory as a more effective source of coarse-grained information. ARD modifies the teacher transformer architecture by adding token-wise time embedding to mark each input from the trajectory history and employs a block-wise causal attention mask for training. Furthermore, incorporating historical inputs only in lower transformer layers enhances performance and efficiency. We validate the effectiveness of ARD in a class-conditioned generation on ImageNet and T2I synthesis. Our model achieves a $5\times$ reduction in FID degradation compared to the baseline methods while requiring only 1.1\% extra FLOPs on ImageNet-256. Moreover, ARD reaches FID of 1.84 on ImageNet-256 in merely 4 steps and outperforms the publicly available 1024p text-to-image distilled models in prompt adherence score with a minimal drop in FID compared to the teacher. Project page: https://github.com/alsdudrla10/ARD.

Summary

Autoregressive Distillation of Diffusion Transformers

The paper "Autoregressive Distillation of Diffusion Transformers" introduces a novel approach to accelerating the sampling process of diffusion models. Diffusion models are revered for their ability to generate high-resolution images with impressive fidelity, yet their iterative sampling process often poses significant computational challenges. This research addresses this bottleneck by presenting AutoRegressive Distillation (ARD), a method that leverages the historical trajectory provided by the ordinary differential equation (ODE) governing diffusion processes, thus mitigating the issues of exposure bias commonly seen in traditional distillation methods.

Key Innovations

The primary innovation within ARD is its autoregressive approach to distillation. Traditional step distillation methods operate solely on the most recent denoised samples, often leading to exposure bias as estimation errors accumulate. ARD contrasts with this method by utilizing a more comprehensive range of information from the ODE trajectory, providing historical context alongside current estimates. This approach facilitates the inclusion of coarse-grained information from earlier integration stages, ensuring predictions are less susceptible to error propagation.

The architectural design is another cornerstone of this paper. ARD modifies the diffusion transformer teacher architecture by introducing token-wise time embeddings to discern each input's timeline within the trajectory. Additionally, it applies a block-wise causal attention mask, allowing the transformer to attend to multiple inputs simultaneously. This multi-input accommodation is essential for enabling autoregressive processing of the historical sample trajectory, promoting better prediction outcomes throughout the denoising process.

Results and Implications

The empirical results demonstrated the effectiveness of ARD in various generation tasks, most notably in class-conditioned generation on ImageNet and text-to-image synthesis. Notably, the proposed ARD model achieved a significant reduction in FID degradation — a metric indicating generative quality — compared to step distillation baselines. The method achieved an FID of 1.84 on ImageNet 256 in only four steps, representing a considerable efficiency breakthrough with minimal additional computational expense (1.1% extra FLOPs).

These achievements underscore ARD's potential to reshape performance efficiency in high-resolution image synthesis. By permitting fewer steps due to enhanced error handling and historical context utilization, ARD offers substantial improvements in computational scalability — a trait increasingly sought after given the rising demand for high-quality, rapid image generation. The framework's adaptability to larger contexts and broader generative scenarios suggests possible expansions and integrations with other large-scale diffusion models.

Future Directions

The implications of ARD extend into broader realms of generative AI. Future developments may explore integrating this autoregressive technique with alternative model architectures or applying it across different domains requiring efficient generative processes. Additionally, investigating enhanced time embedding strategies or nuanced attention mechanisms could further refine performance and adaptability across various architectures.

Moreover, ARD's extension into high-resolution generative tasks presents opportunities for revisiting foundational aspects of diffusion models. The authors' approach signifies an advantageous alignment towards more robust generative mechanisms capable of balancing fidelity and computational efficiency, hinting at potential optimization strategies around the interplay of historical and current data synthesis.

In summary, "Autoregressive Distillation of Diffusion Transformers" presents a compelling methodology for reconciling the computation-intensive demands of diffusion models with practical efficiency enhancements. This research marks a significant stride in generative AI, proffering a framework that not only amplifies performance metrics but also sets the stage for future exploratory avenues in AI-driven content creation.