Papers
Topics
Authors
Recent
Search
2000 character limit reached

Remix-DiT: Expert Diffusion Model Framework

Updated 13 January 2026
  • Remix-DiT is a diffusion model framework that uses parameter mixing of transformer bases to construct specialized expert denoisers, reducing training costs.
  • It synthesizes experts for diverse noise regimes through learned mixing coefficients, enabling efficient capacity allocation without manual interval design.
  • Empirical results on ImageNet show improved FID and sample quality compared to baseline models, with minimal runtime and memory overhead.

Remix-DiT is a diffusion model framework that leverages parameter mixing of transformer-based diffusion model bases to construct a set of specialized expert denoisers, targeting efficiency and quality improvements in generative modeling tasks. Remix-DiT addresses the challenge of allocating model capacity across the diverse noise regimes encountered throughout the diffusion process, enabling fine-grained expert specialization without the high training cost traditionally associated with multi-expert approaches (Fang et al., 2024).

1. Conceptual Motivation and Background

Diffusion models, particularly transformer-based variants (DiT), have demonstrated state-of-the-art results on generative tasks but typically require large architectures to maintain output fidelity across all denoising steps. Each timestep in the diffusion process corresponds to a distinct denoising subtask; a single-network approach must distribute its capacity over the entire sequence, which becomes suboptimal for smaller models and high-complexity datasets.

Previous multi-expert methods allocate separate models to disjoint timestep intervals, improving denoising accuracy but incurring a linear increase in training cost and requiring manual interval design. Remix-DiT circumvents this by training only K basis diffusion transformers and synthesizing N experts for specific timestep intervals via learned mixing coefficients, substantially reducing both computational and memory requirements while enhancing expressivity.

2. Architecture and Mixing Procedure

The Remix-DiT framework defines K basis DiT models, each parametrized by a vector θkbasisRP\theta^{basis}_k \in \mathbb{R}^P, where PP is the number of parameters in a standard DiT. These bases are fused into a single wide architecture, compatible with the canonical DiT layers and block structures.

For each of the NN expert intervals, Remix-DiT learns an unnormalized logit vector iRK\ell_i \in \mathbb{R}^K, corresponding to mixing coefficients:

αi,k=exp(i,k)j=1Kexp(i,j),k=1,,K\alpha_{i,k} = \frac{\exp(\ell_{i,k})}{\sum_{j=1}^K \exp(\ell_{i,j})},\quad k=1,\ldots,K

The parameters of expert ii are formed by

θiexpert=k=1Kαi,k θkbasis\theta^{expert}_i = \sum_{k=1}^K \alpha_{i,k}\ \theta^{basis}_k

The stacking and mixing operations are compatible with standard DiT architectures, and each instantiated expert shares the same computational graph as a plain diffusion transformer, allowing efficient inference and training.

3. Training Objective and Optimization Strategy

Remix-DiT applies the conventional DDPM denoising score matching loss:

L(θ)=Et,x0,ϵϵϵθ(xt,t)2,\mathcal{L}(\theta) = \mathbb{E}_{t, x_0, \epsilon} \|\epsilon - \epsilon_\theta(x_t, t)\|^2,

where xtx_t is constructed from the data sample x0x_0 and Gaussian noise ϵ\epsilon using the standard variance schedule. Training proceeds by uniformly sampling an expert index ii and a timestep tt within the associated interval IiI_i. The model instantiates the expert parameters via the mixing coefficients and applies the denoising objective. To foster basis diversity during early stages, a regularization term R()R(\ell) is imposed based on oracle prior coefficients αi,k\alpha^*_{i,k} (one-hot assignments tied to intervals), with annealing weight γ\gamma:

R()=γi=1Nk=1Kαi,klogαi,kR(\ell) = - \gamma \sum_{i=1}^N \sum_{k=1}^K \alpha^*_{i,k} \log \alpha_{i,k}

The full loss per iteration is thus Lt,i+R()L_{t,i} + R(\ell), and gradients flow into both the basis parameters and the selected mixing logits. AdamW optimizer is employed.

4. Algorithmic Workflow

The training process, suitable for implementation, follows these steps:

1
2
3
4
5
6
7
8
9
repeat until convergence:
    i = UniformRandom({1N})
    t = UniformRandom(I_i)
    alpha_i = softmax(l_i)
    theta_expert = sum_{k=1}^K alpha_i[k] * theta_basis[k]
    x_t, epsilon = sample_batch()
    L_d = mean(||epsilon - eps_theta_expert(x_t, t)||^2)
    R = -gamma * sum_k alpha_star_i[k] * log(alpha_i[k])
    update theta_basis, l_i using AdamW with grad(L_d + R)

During inference, the mixing operation can be executed once per expert and cached ("precompute") or dynamically at runtime for each needed timestep, incurring only minor per-step computational overhead proportional to KPK \cdot P for parameter blending.

5. Computational Requirements and Efficiency

Remix-DiT maintains KPK \cdot P basis parameters and NN sets of mixing logits. Although only one expert interval is active per step, all bases receive gradient updates according to the mixing profile. Relative to a single DiT, observed training slowdowns are approximately 20% (for K=4K=4), and GPU memory increases by the same factor as the number of bases.

For inference, precomputing all NN experts incurs N(KP)N \cdot (K \cdot P) element-wise blends—once per expert—after which inference cost matches standard multi-expert or baseline DiT workflows. When mixing at runtime, the overhead is a few milliseconds per step for the blending operation; transformer layer FLOPs remain unchanged.

Efficiency benchmarks on V100 32GB reveal minimal differences in inference latency and throughput between Remix-DiT and baseline DiT models. For example, Remix-B (4 bases, 20 experts) yields 2.23 steps/sec and 18.65ms latency (runtime mixing), compared to 2.93 steps/sec and 15.77ms for DiT-B.

6. Empirical Performance

Remix-DiT was evaluated on ImageNet 256×256 with classifier-free guidance (cfg=1.5) and 100 sampling steps. Base models DiT-S, DiT-B, and DiT-L were fine-tuned with Remix-DiT for 100K steps and parameters (K=4,N=20)(K=4, N=20).

Representative Results

Model #Eff Params IS FID Prec Rec
DiT-L (1 M steps) 458 M 196.3 3.73 0.819 0.540
+100K cont. train 458 M 200.2 3.57 0.819 0.536
+8× L experts (E-Diff) 8×458 M 205.4 3.41 0.815 0.545
Remix-L (4→20) 4×458 M 207.5 3.22 0.818 0.545

Remix-DiT outperforms baseline DiT and multi-expert ensembles, such as E-Diff and MEME, particularly with respect to FID and sample quality at minimal extra runtime cost. Ablations indicate softmax mixing yields the best FID, global mixing is slightly superior to layer-wise, and (K=4,N=20)(K=4, N=20) is optimal for model capacity utilization; excessive NN is detrimental due to gradient sparsity. Early timesteps benefit from flat mixing, supporting ensemble capacity, while late timesteps prefer sharp single-basis selection, improving high-frequency detail preservation.

7. Insights, Limitations, and Future Directions

Qualitative assessment observes improved shape coherence and texture richness in outputs generated by Remix-DiT, with sharper edges relative to vanilla DiT models. Remix-DiT demonstrates the effectiveness of adaptive expert specialization through parameter mixing, especially at high noise regimes.

Limitations include the sparsity of gradient signals for very large NN, requiring careful regularization of mixing coefficients. Rapid regularization decay or poor priors can cause collapse to similar base functions. Memory overhead increases linearly with KK, and only uniform timestep partitioning has been evaluated—more sophisticated assignments are unexplored.

Potential research avenues include distributed training for scalability to N20N \gg 20, learned interval boundary mappings, extension to conditional (text-to-image), multi-modal, or video diffusion transformers, and exploration of richer mixing networks (such as an MLP α(t)\alpha(t) instead of interval-based embeddings).

Remix-DiT demonstrates that a small set of basis transformers, via learnable mixing coefficients, can substitute for a large ensemble of fully trained experts, achieving enhanced sample quality with modest training and inference cost (Fang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Remix-DiT.