Paris: A Decentralized Trained Open-Weight Diffusion Model

Published 3 Oct 2025 in cs.GR, cs.DC, and cs.LG | (2510.03434v1)

Abstract: We present Paris, the first publicly released diffusion model pre-trained entirely through decentralized computation. Paris demonstrates that high-quality text-to-image generation can be achieved without centrally coordinated infrastructure. Paris is open for research and commercial use. Paris required implementing our Distributed Diffusion Training framework from scratch. The model consists of 8 expert diffusion models (129M-605M parameters each) trained in complete isolation with no gradient, parameter, or intermediate activation synchronization. Rather than requiring synchronized gradient updates across thousands of GPUs, we partition data into semantically coherent clusters where each expert independently optimizes its subset while collectively approximating the full distribution. A lightweight transformer router dynamically selects appropriate experts at inference, achieving generation quality comparable to centrally coordinated baselines. Eliminating synchronization enables training on heterogeneous hardware without specialized interconnects. Empirical validation confirms that Paris's decentralized training maintains generation quality while removing the dedicated GPU cluster requirement for large-scale diffusion models. Paris achieves this using 14$\times$ less training data and 16$\times$ less compute than the prior decentralized baseline.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a decentralized training method that partitions data into semantic clusters to train independent diffusion experts.
It employs a multi-expert architecture with Diffusion Transformers and a router that dynamically selects the most appropriate expert during inference.
The approach achieves competitive performance with reduced compute and data requirements, enabling asynchronous training on heterogeneous hardware.

Paris: A Decentralized Trained Open-Weight Diffusion Model

Introduction

The Paris model introduces a fully decentralized approach to training large-scale text-to-image diffusion models, eliminating the need for synchronized gradient updates and specialized interconnects. By partitioning the training data into semantically coherent clusters and training independent expert models on each partition, Paris demonstrates that high-quality generative performance is achievable without centralized infrastructure. This paradigm shift enables training on heterogeneous, geographically distributed hardware, significantly lowering the barrier to entry for large-scale diffusion modeling.

Figure 1: Text conditioned image generation samples using Paris.

Distributed Diffusion Training Framework

Paris leverages a multi-expert architecture, where each expert is a Diffusion Transformer (DiT) trained in complete isolation on a distinct data cluster. The training pipeline consists of the following steps:

Latent Encoding: Images are encoded into a latent space using a pretrained VAE, reducing computational requirements.
Semantic Clustering: DINOv2 embeddings are used to partition the dataset into $K$ clusters, each representing a distinct semantic domain.
Expert Training: Each expert model is trained independently on its assigned cluster, optimizing a flow matching objective without any inter-expert communication.
Router Training: A lightweight transformer-based router is trained post-hoc to dynamically select the most appropriate expert(s) during inference.

This framework is grounded in decentralized flow matching theory, allowing the global generative distribution to be approximated by the ensemble of locally optimized experts.

Figure 2: Multi-expert training pipeline of Paris.

Model Architecture

DiTExpert

Each expert in Paris is based on the Diffusion Transformer architecture, adapted for decentralized training. Key architectural features include:

Latent Diffusion: Operates on $32 \times 32 \times 4$ latent tensors, following the latent diffusion paradigm.
Transformer Blocks: Incorporate Adaptive Layer Normalization (AdaLN) for timestep conditioning, with optional AdaLN-Single for parameter efficiency.
Text Conditioning: Cross-attention layers enable text-to-image synthesis, using CLIP embeddings projected to the model's hidden dimension.
Scalability: DiTExpert models are validated at two scales—DiT-B/2 (129M parameters per expert) and DiT-XL/2 (605M parameters per expert).

DiTRouter

The router is a smaller Diffusion Transformer variant, designed to process noisy latents and predict the most suitable expert(s) for each denoising step. It incorporates timestep-aware processing and is trained using cross-entropy loss against ground-truth cluster assignments.

Figure 3: Multi-expert inference pipeline of Paris.

Decentralized Flow Matching Objective

The decentralized training objective decomposes the standard flow matching loss across $K$ experts, each optimizing:

$\mathcal{L}_{\text{expert}^{(k)}} = \mathbb{E}_{x_0 \in S_k, t} \left[ \|v_{\theta_k}(x_t, t) - (x_0 - x_t)\|^2 \right]$

where $S_k$ is the data cluster for expert $k$ , $v_{\theta_k}$ is the predicted velocity field, and $x_t$ is the noisy latent at timestep $t$ . The router learns to approximate the posterior $p_t(k|x_t)$ , enabling dynamic expert selection during inference.

Inference Strategies

Paris supports several inference strategies:

Top-1 Expert Selection: Routes each denoising step to the single most confident expert, offering computational efficiency and strong empirical performance.
Top-K Weighted Ensemble: Combines predictions from the $K'$ most relevant experts, weighted by router probabilities, balancing quality and cost.
Full Ensemble Integration: Aggregates all expert predictions, weighted by router probabilities; however, this approach often yields inferior results due to interference from less-relevant experts.

Empirical results indicate that selective expert collaboration (Top-2) outperforms both monolithic and full ensemble approaches.

Resource Efficiency and Parallelization

Paris's decentralized training eliminates all synchronization overhead, enabling asynchronous training on heterogeneous hardware. Unlike traditional data, model, or pipeline parallelism, Paris experts train independently, with no blocking or topology constraints. This allows for efficient utilization of commodity GPUs and fragmented compute resources.

Experimental Results

Paris achieves competitive generation quality with dramatically reduced resource requirements:

FID-50K (Laion-art, DiT-B/2): Top-2 expert selection yields FID of 22.60, a 7.04 improvement over the monolithic baseline (29.64). Full ensemble integration underperforms (FID 47.89).
Comparison with DDM (DiT-XL/2, LAION-Aesthetic): Paris achieves FID of 12.45 using 11M training images and 120 A40 GPU-days, compared to DDM's FID of 9.84 with 158M images and 1176 A100 GPU-days—a 14 $\times$ reduction in data and 16.3 $\times$ reduction in compute, with only 1.27 $\times$ higher FID.
Figure 4: Efficiency comparison of Paris and DDM.

Practical and Theoretical Implications

Paris demonstrates that decentralized training is a viable alternative to centralized approaches for large-scale generative modeling. The elimination of synchronization requirements enables broader participation in model development, democratizing access to high-quality diffusion models. The modular expert architecture facilitates specialization and extensibility, while the router enables dynamic, noise-aware expert selection.

The modest quality gap relative to centralized baselines suggests that further optimization—such as improved clustering, expert distillation, or advanced routing mechanisms—could close the performance gap while retaining resource efficiency. The framework is readily extensible to other modalities (e.g., video, audio) and supports future research in scalable, decentralized generative modeling.

Conclusion

Paris establishes a practical blueprint for fully decentralized diffusion model training, achieving competitive text-to-image synthesis with dramatically reduced data and compute requirements. The multi-expert architecture, combined with dynamic routing, enables efficient, high-quality generation on heterogeneous hardware. This work paves the way for scalable, democratized generative modeling and invites further exploration into decentralized architectures, expert specialization, and cross-modal extensions.

Markdown Report Issue