Papers
Topics
Authors
Recent
Search
2000 character limit reached

Decentralized Diffusion Models (DDMs)

Updated 3 July 2026
  • Decentralized Diffusion Models are a scalable, modular framework that partitions diffusion-based generative training across independent expert clusters.
  • They employ expert-specific flow-matching objectives and sparse Top-k routing to enhance expert-data alignment and improve generation quality.
  • DDMs reduce cross-GPU communication overhead by training experts independently, enabling efficient decentralized model deployment.

Decentralized Diffusion Models (DDMs) are a scalable, modular framework for distributing diffusion-based generative modeling across independent clusters or decentralized computational resources. In contrast to conventional centralized diffusion models that require monolithic high-bandwidth infrastructure, DDMs partition the training process, assigning independent data shards to distinct "expert" diffusion models. At inference, a lightweight router dynamically ensembles expert outputs. DDMs were introduced to address system-level constraints of large-scale model training and further investigated for their unique statistical, algorithmic, and coordination properties (McAllister et al., 9 Jan 2025, Villagra et al., 2 Feb 2026).

1. Formal Structure and Training Methodology

A Decentralized Diffusion Model consists of KK independently trained expert diffusion models {fi(x,t)}i=1K\{f_i(x, t)\}_{i=1}^K, each fit exclusively on a disjoint data cluster Ci\mathcal{C}_i. The forward diffusion process is the standard parameterized Gaussian noising, q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I) for t=1,,Tt = 1, \ldots, T, with βt\beta_t a variance schedule. The reverse process is modeled as pθ(xt1xt)N(xt1;μθ(xt,t),σt2I)p_\theta(x_{t-1}|x_t) \approx \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I), typically reparameterized via a noise predictor ϵθ(xt,t)\epsilon_\theta(x_t, t).

Each expert is trained in isolation, without inter-expert gradient sharing, using the same 2\ell^2 flow-matching or score-matching objective as monolithic diffusion models:

Lflow(k)(θ)=E(x0,t,ϵ)Ckfk(xt,t)ut(xtx0)2L_{\rm flow}^{(k)}(\theta) = \mathbb{E}_{(x_0, t, \epsilon)\in\mathcal{C}_k} \left\| f_k(x_t, t) - u_t(x_t|x_0) \right\|^2

where {fi(x,t)}i=1K\{f_i(x, t)\}_{i=1}^K0 is a noised version of {fi(x,t)}i=1K\{f_i(x, t)\}_{i=1}^K1.

Data partitioning is performed by embedding all samples (e.g., with DINOv2) and clustering into {fi(x,t)}i=1K\{f_i(x, t)\}_{i=1}^K2 semantically coherent partitions. Empirically, random sharding degrades generation quality significantly.

The structure and training process for DDMs are summarized in the following table:

Component Role in DDMs Implementation Detail
Expert Diffusion Model Learns on cluster {fi(x,t)}i=1K\{f_i(x, t)\}_{i=1}^K3 DiT architectures, U-Nets; {fi(x,t)}i=1K\{f_i(x, t)\}_{i=1}^K43B params per expert (McAllister et al., 9 Jan 2025)
Data Partitioning Ensures disjoint, coherent training distributions Two-stage clustering (fine {fi(x,t)}i=1K\{f_i(x, t)\}_{i=1}^K5 coarse), based on learned embeddings
Router Predicts expert weights per inference step Small DiT or CNN; trained to classify cluster label from noised input
Training Objective Expert-specific flow/score-matching Matches conditional marginal flow; ensembles match global model in expectation

2. Inference-Time Routing and Ensembling

During sampling, the DDM system observes the current denoising state {fi(x,t)}i=1K\{f_i(x, t)\}_{i=1}^K6 and computes a routing vector {fi(x,t)}i=1K\{f_i(x, t)\}_{i=1}^K7 with {fi(x,t)}i=1K\{f_i(x, t)\}_{i=1}^K8. The ensemble noise prediction is:

{fi(x,t)}i=1K\{f_i(x, t)\}_{i=1}^K9

The deterministic sampling trajectory follows the probability-flow ODE:

Ci\mathcal{C}_i0

Several routing strategies have been evaluated:

  • Full-ensemble: All experts weighted equally (Ci\mathcal{C}_i1).
  • Sparse Top-Ci\mathcal{C}_i2 routing: Selects the subset of experts most aligned with the denoising state by cluster proximity.
  • Top-1 expert: Chooses the single most probable expert per state.

Sparse Top-Ci\mathcal{C}_i3 routing uses an alignment score Ci\mathcal{C}_i4 based on the distance between the denoising state (embedded to Ci\mathcal{C}_i5) and each expert's data centroid Ci\mathcal{C}_i6. Specifically,

Ci\mathcal{C}_i7

Ci\mathcal{C}_i8

The Ci\mathcal{C}_i9 experts with largest q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)0 receive nonzero weights, normalized over the selected subset (Villagra et al., 2 Feb 2026).

3. Theoretical Foundations and Equivalence

Decentralized training is theoretically justified by showing that the ensemble flow field, weighted by the marginal probability of each cluster, reconstructs the global data distribution's denoising flow. Given disjoint training sets q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)1,

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)2

Each q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)3 is the expert's estimate conditioned on q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)4, and q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)5 the likelihood that q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)6 originated from q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)7. Linearity and expectation guarantee that the ensemble of experts collectively optimizes the same flow-matching loss as the monolithic model (McAllister et al., 9 Jan 2025).

The router is trained to predict q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)8 from q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)9, using cross-entropy loss against the ground-truth cluster ID.

4. Generation Quality, Stability, and Expert-Data Alignment

Contrary to pre-existing assumptions, minimizing numerical sensitivity of the denoising trajectory (i.e., propagation of initial noise perturbations) does not correlate with perceptual sample quality in DDMs. Full-ensemble routing, achieving the lowest denoising trajectory sensitivity (t=1,,Tt = 1, \ldots, T0), yields poor example quality: FID t=1,,Tt = 1, \ldots, T1 versus FID t=1,,Tt = 1, \ldots, T2 for Top-2 sparse routing (on Paris DDM, LAION-Aesthetics) (Villagra et al., 2 Feb 2026).

Alignment between the current state and the selected experts' data manifold is the dominant factor in generative quality. Empirical findings include:

  • Cluster-alignment scores: Sparse routing selects experts whose data clusters are closest in embedding space to t=1,,Tt = 1, \ldots, T3.
  • Per-expert accuracy: Selected experts achieve mean angular deviation t=1,,Tt = 1, \ldots, T4 compared to t=1,,Tt = 1, \ldots, T5 for non-selected experts (29% better, t=1,,Tt = 1, \ldots, T6).
  • Expert disagreement: Higher pairwise disagreement correlates with higher LPIPS and degraded sample quality.
  • Validation on alternate datasets (e.g., MNIST): Selected experts exhibit 43% smaller angular errors.

The following table summarizes routing strategy outcomes (Villagra et al., 2 Feb 2026):

Routing Strategy FID (Paris DDM, LAION) Stability (Sensitivity t=1,,Tt = 1, \ldots, T7) Expert Disagreement Sample Quality
Top-1 30.6 Moderate Low Moderate
Top-2 22.6 Moderate Very low Best
Full 47.9 Lowest Highest Poorest

5. System-Level Scale, Efficiency, and Practical Considerations

DDMs eliminate the need for cross-GPU gradient exchange, reducing inter-cluster bandwidth by over 90%. Every expert is trained independently on its “island.” The only post-training communication is the sharing of expert and router checkpoints (McAllister et al., 9 Jan 2025). Analysis shows:

  • Compute scaling: Linear with number of experts. Training time is parallelized across distributed hardware.
  • Inference FLOPs: Top-1 expert routing matches monolithic FLOPs per sample; ensemble inference can be amortized.
  • Storage: All experts must be retained for ensemble or sparse routing at deployment, unless compressed through student distillation.
  • Hyperparameters: Choice of t=1,,Tt = 1, \ldots, T8 is crucial. Too few experts result in under-specialization, too many cause underfitting due to insufficient data per expert.

Router errors can route samples suboptimally, harming output diversity, but this is mitigated by robust router training and Top-t=1,,Tt = 1, \ldots, T9 selection.

6. Extensions to Decentralized Multi-Agent and Policy Diffusion

The MADiff framework (Zhu et al., 2023) extends DDM concepts to multi-agent reinforcement learning, where decentralized agent policies are modeled as a conditional diffusion generator over joint or per-agent trajectories. While not strictly the same as expert-partitioned image DDMs, MADiff illustrates decentralized generative modeling in cooperative settings:

  • Architecture: Cross-agent U-Net with attention fusion at every decoder layer enables decentralized agents to model peer behaviors for effective policy coordination.
  • Training: Offline, central training on joint trajectories; decentralized execution uses only per-agent observations.
  • Results: Decentralized MADiff outperforms independent diffusion policies, confirming that independent agent DMs without shared context degrade joint performance.

A key empirical finding across DDM design, even in multi-agent scenarios, is that decentralized diffusion performs best when model specialization (by data or agent role) is complemented by mechanisms for coordination, alignment, or selective ensembling (Zhu et al., 2023, Villagra et al., 2 Feb 2026).

For effective DDM deployment:

  • Routing algorithms should prioritize expert-data alignment by measuring the proximity of βt\beta_t0 to expert centroids using learned embeddings. Sparse Top-βt\beta_t1 routing is preferred over uniform averaging.
  • Avoid full-ensemble averaging unless all experts have been jointly trained or disagreement is minimal.
  • Distillation offers a route to collapse ensemble experts into a single model for resource efficiency.
  • Application domains: Beyond vision, DDMs present potential for privacy-preserving federated training, multimodal data, and decentralized policy learning in reinforcement learning.

Open research directions include methods for federated privacy (local training, router-only sharing), hybridization with communication-efficient distributed learning (e.g., FedAvg, Gossip), and extending DDM techniques to video, audio, or sequential data policies (McAllister et al., 9 Jan 2025).

In summary, Decentralized Diffusion Models establish a paradigm where statistical and computational efficiency can be decoupled from centralized infrastructure, with rigorous alignment-based routing as the central principle for high-quality generative modeling (McAllister et al., 9 Jan 2025, Villagra et al., 2 Feb 2026, Zhu et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Decentralized Diffusion Models (DDMs).