Decentralized Diffusion Models (DDMs)

Updated 3 July 2026

Decentralized Diffusion Models are a scalable, modular framework that partitions diffusion-based generative training across independent expert clusters.
They employ expert-specific flow-matching objectives and sparse Top-k routing to enhance expert-data alignment and improve generation quality.
DDMs reduce cross-GPU communication overhead by training experts independently, enabling efficient decentralized model deployment.

Decentralized Diffusion Models (DDMs) are a scalable, modular framework for distributing diffusion-based generative modeling across independent clusters or decentralized computational resources. In contrast to conventional centralized diffusion models that require monolithic high-bandwidth infrastructure, DDMs partition the training process, assigning independent data shards to distinct "expert" diffusion models. At inference, a lightweight router dynamically ensembles expert outputs. DDMs were introduced to address system-level constraints of large-scale model training and further investigated for their unique statistical, algorithmic, and coordination properties (McAllister et al., 9 Jan 2025, Villagra et al., 2 Feb 2026).

1. Formal Structure and Training Methodology

A Decentralized Diffusion Model consists of $K$ independently trained expert diffusion models $\{f_i(x, t)\}_{i=1}^K$ , each fit exclusively on a disjoint data cluster $\mathcal{C}_i$ . The forward diffusion process is the standard parameterized Gaussian noising, $q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$ for $t = 1, \ldots, T$ , with $\beta_t$ a variance schedule. The reverse process is modeled as $p_\theta(x_{t-1}|x_t) \approx \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)$ , typically reparameterized via a noise predictor $\epsilon_\theta(x_t, t)$ .

Each expert is trained in isolation, without inter-expert gradient sharing, using the same $\ell^2$ flow-matching or score-matching objective as monolithic diffusion models:

$L_{\rm flow}^{(k)}(\theta) = \mathbb{E}_{(x_0, t, \epsilon)\in\mathcal{C}_k} \left\| f_k(x_t, t) - u_t(x_t|x_0) \right\|^2$

where $\{f_i(x, t)\}_{i=1}^K$ 0 is a noised version of $\{f_i(x, t)\}_{i=1}^K$ 1.

Data partitioning is performed by embedding all samples (e.g., with DINOv2) and clustering into $\{f_i(x, t)\}_{i=1}^K$ 2 semantically coherent partitions. Empirically, random sharding degrades generation quality significantly.

The structure and training process for DDMs are summarized in the following table:

Component	Role in DDMs	Implementation Detail
Expert Diffusion Model	Learns on cluster $\{f_i(x, t)\}_{i=1}^K$ 3	DiT architectures, U-Nets; $\{f_i(x, t)\}_{i=1}^K$ 43B params per expert (McAllister et al., 9 Jan 2025)
Data Partitioning	Ensures disjoint, coherent training distributions	Two-stage clustering (fine $\{f_i(x, t)\}_{i=1}^K$ 5 coarse), based on learned embeddings
Router	Predicts expert weights per inference step	Small DiT or CNN; trained to classify cluster label from noised input
Training Objective	Expert-specific flow/score-matching	Matches conditional marginal flow; ensembles match global model in expectation

2. Inference-Time Routing and Ensembling

During sampling, the DDM system observes the current denoising state $\{f_i(x, t)\}_{i=1}^K$ 6 and computes a routing vector $\{f_i(x, t)\}_{i=1}^K$ 7 with $\{f_i(x, t)\}_{i=1}^K$ 8. The ensemble noise prediction is:

$\{f_i(x, t)\}_{i=1}^K$ 9

The deterministic sampling trajectory follows the probability-flow ODE:

$\mathcal{C}_i$ 0

Several routing strategies have been evaluated:

Full-ensemble: All experts weighted equally ( $\mathcal{C}_i$ 1).
Sparse Top- $\mathcal{C}_i$ 2 routing: Selects the subset of experts most aligned with the denoising state by cluster proximity.
Top-1 expert: Chooses the single most probable expert per state.

Sparse Top- $\mathcal{C}_i$ 3 routing uses an alignment score $\mathcal{C}_i$ 4 based on the distance between the denoising state (embedded to $\mathcal{C}_i$ 5) and each expert's data centroid $\mathcal{C}_i$ 6. Specifically,

$\mathcal{C}_i$ 7

$\mathcal{C}_i$ 8

The $\mathcal{C}_i$ 9 experts with largest $q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$ 0 receive nonzero weights, normalized over the selected subset (Villagra et al., 2 Feb 2026).

3. Theoretical Foundations and Equivalence

Decentralized training is theoretically justified by showing that the ensemble flow field, weighted by the marginal probability of each cluster, reconstructs the global data distribution's denoising flow. Given disjoint training sets $q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$ 1,

$q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$ 2

Each $q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$ 3 is the expert's estimate conditioned on $q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$ 4, and $q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$ 5 the likelihood that $q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$ 6 originated from $q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$ 7. Linearity and expectation guarantee that the ensemble of experts collectively optimizes the same flow-matching loss as the monolithic model (McAllister et al., 9 Jan 2025).

The router is trained to predict $q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$ 8 from $q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$ 9, using cross-entropy loss against the ground-truth cluster ID.

4. Generation Quality, Stability, and Expert-Data Alignment

Contrary to pre-existing assumptions, minimizing numerical sensitivity of the denoising trajectory (i.e., propagation of initial noise perturbations) does not correlate with perceptual sample quality in DDMs. Full-ensemble routing, achieving the lowest denoising trajectory sensitivity ( $t = 1, \ldots, T$ 0), yields poor example quality: FID $t = 1, \ldots, T$ 1 versus FID $t = 1, \ldots, T$ 2 for Top-2 sparse routing (on Paris DDM, LAION-Aesthetics) (Villagra et al., 2 Feb 2026).

Alignment between the current state and the selected experts' data manifold is the dominant factor in generative quality. Empirical findings include:

Cluster-alignment scores: Sparse routing selects experts whose data clusters are closest in embedding space to $t = 1, \ldots, T$ 3.
Per-expert accuracy: Selected experts achieve mean angular deviation $t = 1, \ldots, T$ 4 compared to $t = 1, \ldots, T$ 5 for non-selected experts (29% better, $t = 1, \ldots, T$ 6).
Expert disagreement: Higher pairwise disagreement correlates with higher LPIPS and degraded sample quality.
Validation on alternate datasets (e.g., MNIST): Selected experts exhibit 43% smaller angular errors.

The following table summarizes routing strategy outcomes (Villagra et al., 2 Feb 2026):

Routing Strategy	FID (Paris DDM, LAION)	Stability (Sensitivity $t = 1, \ldots, T$ 7)	Expert Disagreement	Sample Quality
Top-1	30.6	Moderate	Low	Moderate
Top-2	22.6	Moderate	Very low	Best
Full	47.9	Lowest	Highest	Poorest

5. System-Level Scale, Efficiency, and Practical Considerations

DDMs eliminate the need for cross-GPU gradient exchange, reducing inter-cluster bandwidth by over 90%. Every expert is trained independently on its “island.” The only post-training communication is the sharing of expert and router checkpoints (McAllister et al., 9 Jan 2025). Analysis shows:

Compute scaling: Linear with number of experts. Training time is parallelized across distributed hardware.
Inference FLOPs: Top-1 expert routing matches monolithic FLOPs per sample; ensemble inference can be amortized.
Storage: All experts must be retained for ensemble or sparse routing at deployment, unless compressed through student distillation.
Hyperparameters: Choice of $t = 1, \ldots, T$ 8 is crucial. Too few experts result in under-specialization, too many cause underfitting due to insufficient data per expert.

Router errors can route samples suboptimally, harming output diversity, but this is mitigated by robust router training and Top- $t = 1, \ldots, T$ 9 selection.

6. Extensions to Decentralized Multi-Agent and Policy Diffusion

The MADiff framework (Zhu et al., 2023) extends DDM concepts to multi-agent reinforcement learning, where decentralized agent policies are modeled as a conditional diffusion generator over joint or per-agent trajectories. While not strictly the same as expert-partitioned image DDMs, MADiff illustrates decentralized generative modeling in cooperative settings:

Architecture: Cross-agent U-Net with attention fusion at every decoder layer enables decentralized agents to model peer behaviors for effective policy coordination.
Training: Offline, central training on joint trajectories; decentralized execution uses only per-agent observations.
Results: Decentralized MADiff outperforms independent diffusion policies, confirming that independent agent DMs without shared context degrade joint performance.

A key empirical finding across DDM design, even in multi-agent scenarios, is that decentralized diffusion performs best when model specialization (by data or agent role) is complemented by mechanisms for coordination, alignment, or selective ensembling (Zhu et al., 2023, Villagra et al., 2 Feb 2026).

7. Recommended Practices and Open Challenges

For effective DDM deployment:

Routing algorithms should prioritize expert-data alignment by measuring the proximity of $\beta_t$ 0 to expert centroids using learned embeddings. Sparse Top- $\beta_t$ 1 routing is preferred over uniform averaging.
Avoid full-ensemble averaging unless all experts have been jointly trained or disagreement is minimal.
Distillation offers a route to collapse ensemble experts into a single model for resource efficiency.
Application domains: Beyond vision, DDMs present potential for privacy-preserving federated training, multimodal data, and decentralized policy learning in reinforcement learning.

Open research directions include methods for federated privacy (local training, router-only sharing), hybridization with communication-efficient distributed learning (e.g., FedAvg, Gossip), and extending DDM techniques to video, audio, or sequential data policies (McAllister et al., 9 Jan 2025).

In summary, Decentralized Diffusion Models establish a paradigm where statistical and computational efficiency can be decoupled from centralized infrastructure, with rigorous alignment-based routing as the central principle for high-quality generative modeling (McAllister et al., 9 Jan 2025, Villagra et al., 2 Feb 2026, Zhu et al., 2023).

Markdown Report Issue Upgrade to Chat

References (3)

Decentralized Diffusion Models (2025)

Expert-Data Alignment Governs Generation Quality in Decentralized Diffusion Models (2026)

MADiff: Offline Multi-agent Learning with Diffusion Models (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Decentralized Diffusion Models (DDMs).