Exploring Diffusion Transformer Designs via Grafting

Published 5 Jun 2025 in cs.LG and cs.AI | (2506.05340v2)

Abstract: Designing model architectures requires decisions such as selecting operators (e.g., attention, convolution) and configurations (e.g., depth, width). However, evaluating the impact of these decisions on model quality requires costly pretraining, limiting architectural investigation. Inspired by how new software is built on existing code, we ask: can new architecture designs be studied using pretrained models? To this end, we present grafting, a simple approach for editing pretrained diffusion transformers (DiTs) to materialize new architectures under small compute budgets. Informed by our analysis of activation behavior and attention locality, we construct a testbed based on the DiT-XL/2 design to study the impact of grafting on model quality. Using this testbed, we develop a family of hybrid designs via grafting: replacing softmax attention with gated convolution, local attention, and linear attention, and replacing MLPs with variable expansion ratio and convolutional variants. Notably, many hybrid designs achieve good quality (FID: 2.38-2.64 vs. 2.27 for DiT-XL/2) using <2% pretraining compute. We then graft a text-to-image model (PixArt-Sigma), achieving a 1.43x speedup with less than a 2% drop in GenEval score. Finally, we present a case study that restructures DiT-XL/2 by converting every pair of sequential transformer blocks into parallel blocks via grafting. This reduces model depth by 2x and yields better quality (FID: 2.77) than other models of comparable depth. Together, we show that new diffusion model designs can be explored by grafting pretrained DiTs, with edits ranging from operator replacement to architecture restructuring. Code and grafted models: https://grafting.stanford.edu

Abstract PDF Upgrade to Chat

Authors (12)

Summary

The paper introduces grafting for diffusion transformers to modify pretrained models without expensive pretraining by replacing key operators.
It employs a two-stage method: activation distillation to initialize new operators and lightweight fine-tuning to mitigate propagation errors.
Empirical results demonstrate competitive FID scores and generation speedups, highlighting efficient architectural redesign for generative tasks.

Exploring Diffusion Transformer Designs via Grafting: A Technical Overview

The paper "Exploring Diffusion Transformer Designs via Grafting" presents an innovative approach to architectural exploration in the context of pretrained diffusion transformers (DiTs). Traditional methods for evaluating architectural designs necessitate expensive pretraining, thereby constraining the scope of investigation. This research introduces grafting, which allows for the modification of pretrained models to explore new designs using minimal computational resources. Below, I provide a detailed analysis of the methodology and its implications, with particular emphasis on the impact of grafting approaches on diffusion transformers employed for generative tasks.

Methodology

The grafting technique focuses on modifying diffusion transformers through the replacement of existing operators with alternative efficient ones, while preserving model quality. The authors segment the grafting process into two key stages:

Activation Distillation: To address the challenge of initializing new operators within the architectural framework, this stage applies a regression-based approach, termed activation distillation, which transfers the original functionality from existing operators to new ones by distilling their activation patterns.
Lightweight Fine-tuning: Once the new operators are integrated, the paper emphasizes the importance of mitigative measures against propagation errors through end-to-end fine-tuning, leveraging limited data to recalibrate the system's behavior.

Distinct strategies are examined within the context of operator replacement, focusing on replacing Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP) operators with alternatives such as gated convolutions and linear attention methods.

Numerical Results and Analysis

Empirical studies conducted on the replacement of operators in DiT-XL/2 highlight that many hybrid architectures achieve competitive FID scores (2.38-2.64 compared to 2.27 for the baseline), utilizing less than 2% of the compute required for pretraining. Specific numerical insights are drawn from comparisons of generative quality across different configurations, emphasizing performance stability and computational efficiency achieved by interleaved grafting designs.

In high-resolution text-to-image generation contexts, such as PixArt- $\Sigma$ , the implementation of grafting yielded a 1.43x speedup in generation latency with less than a 2% decline in GenEval scores, demonstrating its efficacy in real-world applications with long sequence lengths and multimodal setups. The case study reducing model depth from 28 to 14 layers without detracting from generative quality underscores the potential for architectural restructuring through grafting, aligning operational efficiency with modern hardware capabilities.

Implications and Future Directions

The research proposes significant implications for the study and deployment of complex generative models. By illustrating a pathway for efficient architectural experimentation without incurring prohibitive computational costs, grafting facilitates the exploration of varied design spaces, potentially driving advancements in AI practicability in constrained environments.

Future work may consider extending grafting techniques beyond diffusion transformers to a broader range of model architectures. Additional exploration into normalization layers and activation functions presents opportunities for further optimization. Furthermore, the strategic refinement of synthetic data utilization in scaffold-based modeling remains a pertinent avenue for enhancing the reliability of adapted systems.

In conclusion, grafting presents a compelling methodology for architectural exploration in generative modeling, offering a reduction in computational barriers while preserving model integrity. Its application in diverse contexts—from image generation to text-to-video modeling—may catalyze more rapid innovations and enhanced model performance across AI research and industrial applications.

Markdown Report Issue