Paris 2.0: A Decentralized Diffusion Model for Video Generation
Abstract: We present Paris 2.0, the first video generation model pre-trained through decentralized computation. Its training recipe builds upon Paris 1.0 (arXiv:2510.03434), the first ever open-weight Decentralized Diffusion Model (DDM), which showed that image generation can be trained without a monolithic GPU cluster. However, temporally coherent video generation had remained an open problem under decentralized training, and Paris 2.0 closes it. In low-resolution text-to-video training, against a monolithic model trained on the same data under a matched total compute budget, Paris 2.0 cuts Frechet Video Distance (FVD) from 561.04 to 279.01, a ~2.0x improvement, and lifts CLIP text-video similarity and aesthetic score.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview: What this paper is about
This paper introduces Paris 2.0, a new way to train AI that makes videos from text. The big idea is to train several smaller “specialist” models on different computers spread around the world, instead of one giant model on a single expensive supercomputer. This decentralized approach was proven for images in Paris 1.0; Paris 2.0 shows it also works for videos, which are much harder because they have motion and many frames that must stay consistent over time.
What questions were the researchers asking?
- Can we train a high‑quality video generator without a giant, tightly connected GPU cluster?
- If we split training across several independent “expert” models (each trained on its own slice of data), can a smart “router” combine them to make better videos than a single big model?
- Will this work under a fair comparison, where both approaches use the same amount of training data and compute?
How they approached the problem
The big idea: a team of specialists instead of one generalist
Imagine making a movie with a crew of specialists: one is great at talking‑head shots, another at hands crafting, another at action scenes. Each specialist practices on their own. Then, during filming, a director decides who should handle each moment. In Paris 2.0:
- Each “expert” is a full video model trained separately on a different cluster of videos.
- A small “router” acts like the director. At every tiny step of video creation, it looks at the current progress and picks which expert(s) to use next.
This means:
- No constant back‑and‑forth between computers while training. Each expert can be trained on cheaper, scattered machines.
- At generation time, you use only a few experts per step, so it’s efficient while still benefiting from specialization.
How the model is built (in simple terms)
- Videos are first “compressed” into a compact code called a latent, using a tool named HunyuanVAE. Think of this like zipping a video so it’s smaller and faster to process.
- Text prompts are turned into numbers the model understands using popular text encoders (T5 and CLIP).
- Each expert is a large transformer (about 11 billion parameters) that learns how to turn random noise into a video that matches the text.
- The router is much smaller (about 100 million parameters). It reads the current noisy video state plus the text, and decides which expert(s) to call at that moment.
How training works (with simple analogies)
- Diffusion/flow matching: Creating a video starts from pure “static” (noise), then the model repeatedly cleans it up into a clear video. Flow matching teaches the model the direction to move at each step—like giving someone turn‑by‑turn directions from “random” to “high‑quality video.”
- Experts are trained separately: Each expert only sees its portion of the data and learns to be great at that slice (like practicing a specific sport).
- The router is trained like a label predictor: It learns to recognize which data cluster a noisy video snippet belongs to, so later it can pick the right expert for similar situations.
What did they find?
Under the same data and total compute budget as a traditional single model, Paris 2.0 with three experts:
- Made videos that looked more realistic, cutting a key quality score (FVD) nearly in half: from 561.04 down to 279.01. Lower is better here.
- Matched the text prompts better (higher CLIP text‑video similarity).
- Looked better overall (higher aesthetic score).
Why this matters:
- The results show the “team of specialists + router” approach doesn’t just keep up—it actually beats the single big model under fair conditions.
- Additional tests showed the gain comes from smart routing and specialization, not just from having more parameters or combining outputs blindly:
- “Switching schedule” tests (manually alternating experts during generation) still improved results, suggesting experts shine at different stages of the process (early noisy steps vs later fine‑detail steps).
- “Expert specialization” tests showed each expert truly learned its own data cluster, which is why the router can effectively assign them.
Why this is important
- Better quality with flexible training: You can train powerful video generators using many separate, less expensive machines—even across different locations and clouds—without the strict, expensive setup big labs use.
- Scales naturally: Need more capacity? Add another expert trained on another slice of data. No need to rebuild one giant synchronized system.
- Useful for future “world models”: Robots and game AIs need to predict how the world changes. Since the real world is diverse, having specialized experts for different environments (and a router to pick among them) is a natural fit.
- More accessible AI: Decentralized training can help smaller teams and communities contribute to and benefit from high‑quality video models, pushing the field toward more open and collaborative development.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
The paper demonstrates promising Stage 1 results for decentralized video diffusion, but leaves multiple areas under-explored. The following list pinpoints concrete gaps and questions to guide follow-on work:
- Scaling beyond three experts:
- How does quality, temporal consistency, and router stability evolve with 8–64 experts?
- What are the compute–quality trade-offs as top-K increases and as expert overlap grows?
- Router design and training:
- The Stage 1 router is a simple cluster classifier over noisy latents; does joint/iterative training with experts, curriculum over timesteps, or learned timestep-aware objectives improve routing?
- How sensitive is routing to using only pooled CLIP text embeddings (omitting T5)? Does including richer text features reduce misrouting for complex semantics?
- Training-time/inference-time distribution shift: the router is trained on noised dataset latents, not on latents along expert-driven denoising trajectories. How large is this mismatch, and can it be reduced (e.g., by training with teacher-forced expert trajectories or self-play)?
- Router calibration and uncertainty: how to detect and handle out-of-cluster prompts or novel domains (e.g., fallback to top-1, adaptive K, or abstention)?
- Router update without full retraining when new experts are added (true horizontal scalability); can a plug-in calibration/adapter layer support continual addition?
- Expert specialization and cluster formation:
- The paper does not specify how data clusters are defined or discovered. Can unsupervised or semi-supervised clustering outperform manual/heuristic clusters, and how should clusters be refreshed over time (concept drift)?
- How to handle overlapping clusters, long-tail domains, and class imbalance while preventing expert collapse or mode dropping?
- What is the impact of noisy or ambiguous cluster labels on router accuracy and overall generation quality?
- Temporal quality and consistency metrics:
- The main evaluation reports FVD, CLIP similarity, aesthetics, and motion magnitude, but not temporal consistency (e.g., tLPIPS, tOF consistency, identity/structure preservation, flicker) or long-horizon coherence. How does DDM perform on these?
- The “Motion (px/frame)” increase is descriptive and not a consistency measure; does DDM induce desirable motion vs. instability or drift? Provide controlled studies and user preference tests.
- Resolution, duration, and modality coverage:
- Results are at 256×256 and short clips. How do gains carry to higher resolutions (e.g., 512–1024px), longer durations, and variable frame rates?
- Only text-to-video is evaluated, despite an optional first-frame path; what is performance on image-to-video (adherence to the first frame, identity preservation) and multi-modal conditioning?
- Baselines and ablations:
- Compare to monolithic Mixture-of-Experts (MoE) backbones with end-to-end learned gating and to classical ensembles (e.g., snapshot or data-silo ensembles) at matched compute.
- Study different expert-combination rules (top-1 vs. soft mixtures vs. hard switching) and per-step vs. per-interval routing.
- Evaluate alternative samplers/schedules (e.g., DPM-Solver++, fewer/more steps) and their interaction with routing decisions across timesteps.
- Training objectives and VAEs:
- Assess whether flow matching is uniquely synergistic with decentralized routing vs. ε/velocity prediction or consistency training.
- The system relies on cached HunyuanVAE latents; what is the effect of the VAE’s compression quality on downstream FVD and temporal coherence, and how robust is the pipeline to VAE changes? Is online encoding preferable in some settings?
- Generalization and robustness:
- Out-of-cluster generalization: how does DDM perform on domains absent from all experts or on novel prompt styles/languages?
- Robustness to adversarial or low-quality experts (relevant in decentralized, potentially untrusted compute): can the router detect and down-weight compromised experts?
- Sensitivity analyses for prompt length, rare concepts, multi-language prompts, and compositional instructions.
- Inference efficiency and deployment:
- Quantify per-sample latency and compute vs. monolithic models for different top-K and step counts; report compute–quality Pareto curves.
- If experts are hosted on different machines, what is the network/latency overhead of per-step cross-expert calls? Is practical low-latency distributed inference feasible without co-location?
- Memory/VRAM implications of loading multiple experts vs. on-demand paging or expert distillation.
- Cost, reliability, and environmental impact:
- While training removes inter-GPU synchronization, the paper lacks empirical cost, wall-clock, and energy comparisons for decentralized vs. monolithic runs (including spot/preemptible dynamics, failure recovery, and hardware heterogeneity).
- What scheduler or orchestration strategies best mitigate stragglers and preemption in decentralized training?
- Safety, ethics, and data governance:
- How do decentralized training and expert specialization affect content safety, bias, and fairness across demographics/domains? Are some clusters disproportionately degraded or amplified?
- Data provenance and privacy in distributed settings (e.g., federated or volunteer compute) and mechanisms for auditability and poisoning/backdoor defenses.
- World-model/agent applicability:
- Claims about applicability to physical AI/world models are not validated. Can expert specialization per environment/task improve closed-loop planning rollouts, and how does routing perform under action-conditional generation and compounding error?
- Reproducibility and transparency:
- Critical details are missing for replication: precise dataset composition, cluster definitions, training hyperparameters, and whether weights/checkpoints are released.
- Standardized evaluations on public benchmarks (e.g., MSR-VTT, UCF-101, Kinetics, VBench) are absent; reporting across multiple datasets is needed to assess generality.
- Stability of expert mixing:
- The weighted sum of velocity fields can, in principle, cause interference between experts with divergent dynamics. Under what conditions does mixing help vs. harm, and can constraints or normalization improve stability?
- Continual learning and maintenance:
- Procedures for adding/removing experts over time, handling data drift, and updating router/expert pairs without catastrophic forgetting are not specified.
- Can smaller distilled experts approximate specialized large experts to reduce inference cost while preserving routing benefits?
Practical Applications
Immediate Applications
The following applications can be deployed with today’s tooling and workflows, leveraging Paris 2.0’s decentralized training recipe, expert routing at inference, and the demonstrated quality gains (lower FVD, higher CLIP/aesthetic) under matched compute.
- Cost-optimized decentralized video model training for studios/startups
- Sector: Media/entertainment, advertising, software
- What to build: Train multiple 11B “expert” models on cluster-specific data (e.g., talking heads, product demos, sports) using spot instances or heterogeneous, geographically distributed GPUs; later route across them with a lightweight router at inference.
- Dependencies/assumptions: Access to legal, clusterable video datasets; orchestration to manage asynchronous, per-expert training; availability of base weights (e.g., FLUX.1-dev) and HunyuanVAE; tolerance for fault-prone spot instances.
- Router-augmented video generator that improves quality at monolithic compute
- Sector: Creative tools, SaaS video generation platforms
- What to build: Integrate a DiT-B router with a small top-K activation over specialized experts in an existing sampling stack (e.g., Euler-50), keeping per-sample compute ≈ single backbone while improving FVD and prompt alignment.
- Dependencies/assumptions: Trained expert pool; router trained as a cluster classifier over noisy latents; deployment GPU with enough memory to host selected experts and router.
- Domain-specific expert fine-tuning for high-value verticals
- Sector: E-commerce (product spins/demos), education (lab/how-to videos), marketing (talking heads), social (ASMR/texture interactions)
- What to build: Fine-tune or pre-train experts on clustered vertical datasets (e.g., “hands-on crafts,” “face-to-camera explainers”), then route per sample; reduce prompt-engineering burden by exploiting specialization.
- Dependencies/assumptions: Curated, rights-cleared vertical datasets; reproducible clustering pipeline (by motion/camera/scene dynamics); evaluation with CLIP/FVD/aesthetic metrics.
- “Switching schedule” sampler for quality gains without a trained router
- Sector: Open-source diffusion ecosystems, internal R&D
- What to build: Add manual alternating schedules (e.g., Expert A for high-noise steps, Expert B for low-noise steps) to existing samplers; A/B test against single-expert baselines as shown by the paper’s ablation.
- Dependencies/assumptions: At least two specialized experts; sampler that allows per-step expert selection; monitoring to detect directional specialization across denoising time.
- Latent caching and decoupled perception stack for faster training
- Sector: MLOps/infra
- What to build: Pipeline that pre-encodes videos with HunyuanVAE and prompts with T5-XXL/CLIP, caches these tensors, and feeds them directly into expert training to cut training wall-time and cost.
- Dependencies/assumptions: Reliable storage for cached latents; consistent VAE/noise schedule across experts; job schedulers that exploit cached inputs.
- Multi-cloud/consumer-GPU orchestration for expert training
- Sector: Cloud/DevOps
- What to build: An orchestrator that dispatches independent expert training jobs across heterogeneous hardware (cloud regions, on-prem, consumer GPUs) without inter-GPU synchronization.
- Dependencies/assumptions: Checkpoint robustness across interruptions; autoscaling for spot/preemptible nodes; secure model artifact exchange.
- Evaluation and dataset curation by specialization signals
- Sector: Research/ML platform teams
- What to build: Tooling to measure in-cluster vs out-of-cluster CLIP gaps per expert, FVD per cluster, and motion statistics; use signals to refine clusters and retrain experts where specialization is weak.
- Dependencies/assumptions: Deterministic, stratified evaluation sets; metric computation at scale; feedback loop into data clustering.
- Router- and expert-aware content workflows in creative suites
- Sector: Video editing and design software
- What to build: Plugins that expose expert selection and routing controls (e.g., “use motion-heavy expert for this segment”), optionally with image-to-video conditioning via DINOv2 for first frames.
- Dependencies/assumptions: API access to experts and router; UI/UX to surface routing weights and top-K controls; GPU acceleration for inference.
- Rapid prototyping of environment-specific video world models (offline)
- Sector: Robotics research, simulation
- What to build: Train experts on environment-specific video logs (lab benches, warehouses) to improve predictive video rollouts used for planning or data generation; route per environment at inference.
- Dependencies/assumptions: Environment-labeled datasets; offline usage (action-conditioning not yet addressed by Paris 2.0); careful evaluation of temporal coherence and error compounding.
- Procurement and policy shift toward distributed compute
- Sector: Policy/IT governance
- What to build: Update procurement to prioritize heterogeneous, low-cost compute for per-expert training (including spot instances), with governance for data provenance and contributor compliance.
- Dependencies/assumptions: Policies for IP/content rights; contributor agreements; observability for decentralized training contributions.
Long-Term Applications
These applications require additional research, scaling to higher resolutions/longer durations, safety frameworks, or new ecosystem development.
- “Expert economy” marketplaces for decentralized training
- Sector: Software platforms, creator economy
- Product idea: Market where individuals/organizations train experts on niche clusters and publish them; routers learn to allocate traffic and revenue to high-performing experts.
- Dependencies/assumptions: Standardized checkpoint formats and router APIs; quality/safety vetting; attribution and payment rails; anti-poisoning and IP-compliance controls.
- Large-scale, action-conditioned world models via decentralized experts
- Sector: Robotics, autonomous systems, industrial automation
- Product idea: Per-environment or per-task experts (e.g., factory cell, kitchen, OR) integrated into action-conditioned, temporally coherent video world models; router composes predictions across contexts.
- Dependencies/assumptions: Extend from pure video diffusion to action-conditioned dynamics; long-horizon consistency; sim2real validation; safety and failure-mode auditing.
- Continual learning by adding new experts without forgetting
- Sector: ML infrastructure, research
- Product idea: Add experts for new domains (e.g., underwater, space, micro-scale) without retraining legacy experts; update router to exploit new capacity while preserving old behaviors.
- Dependencies/assumptions: Router re-training without catastrophic interference; governance for model sprawl; storage/versioning strategy.
- Energy-aware, grid-friendly decentralized training
- Sector: Energy/compute sustainability
- Product idea: Schedulers that train experts when renewable generation is abundant or prices low, shifting load across regions; reporting on carbon intensity per expert.
- Dependencies/assumptions: Access to real-time grid signals; migration-tolerant training jobs; carbon accounting standards for ML.
- On-device or edge-streaming video generation with sparse expert activation
- Sector: Mobile/AR/VR, consumer devices
- Product idea: Compress experts and router for edge GPUs; activate a minimal top-K pathway per step to approach real-time generation for AR effects or live content.
- Dependencies/assumptions: Model compression/quantization; memory-efficient VAEs; latency-optimized samplers; thermal/power constraints.
- Privacy-preserving personalized experts
- Sector: Consumer software, enterprise productivity
- Product idea: Per-user experts trained locally on private style/asset libraries; router composes public and private experts for personalized outputs without sharing raw data.
- Dependencies/assumptions: Federated or local-only training flows; secure aggregation of routing signals; differential privacy where needed.
- Healthcare simulation and training content via specialized experts
- Sector: Healthcare, medical education
- Product idea: Experts specialized on modality-specific videos (e.g., ultrasound, endoscopy) to generate realistic training scenarios and augment datasets.
- Dependencies/assumptions: Strict data governance and de-identification; clinical validation of realism and safety; bias monitoring; extension to higher resolutions and longer clips.
- Education-specific video generation (labs, demonstrations, multi-step processes)
- Sector: EdTech
- Product idea: Experts for common classroom experiments and demonstrations; router ensures appropriate motion patterns for multi-step explanations.
- Dependencies/assumptions: Rights-cleared educational datasets; content accuracy review pipelines; curriculum alignment.
- Robustness and safety routers
- Sector: Trust & safety, compliance
- Product idea: Routers that downweight or exclude experts likely to produce unsafe or off-policy content based on latent signals; provenance-aware routing decisions.
- Dependencies/assumptions: Safety classifiers integrated into routing loop; red-teaming and audit datasets; interpretability for routing choices.
- Generalized multimodal decentralized ensembles
- Sector: Multimodal AI (text–image–audio–video)
- Product idea: Extend the Paris 2.0 recipe to experts across modalities (e.g., audio, 3D), with routers selecting experts conditioned on combined latent states for richer media generation.
- Dependencies/assumptions: Cross-modal latent alignment; unified noise schedules; inference orchestration for heterogeneous expert types.
Notes on feasibility across applications:
- Scaling limits: The reported gains are at 256×256 and low-resolution settings; high-res, long-duration video may require revalidation.
- Data: Quality and legality of clusterable datasets critically affect outcomes; bias and representativeness must be monitored per expert.
- Availability: Open release of Paris 2.0 weights/code and HunyuanVAE is assumed for turnkey adoption; otherwise, replication effort is needed.
- Compute: Inference remains bounded by “single-backbone-like” compute per sample with small top-K; hosting many experts concurrently still has memory/latency costs.
- Safety/compliance: Decentralized contributions introduce risks (data misuse, poisoning); governance and verification layers are essential before production use.
Glossary
- all-reduce: A collective communication operation that aggregates tensors (e.g., sums gradients) across devices and shares the result back to each participant. "all-reduce gradients, once per step"
- all-to-all: A communication pattern where every device exchanges data with every other device, often used in attention layers. "all-to-all, every attention layer"
- CLIP-ViT-L/14: A CLIP variant using a ViT-L/14 visual backbone, used here to produce pooled text embeddings for video generation. "T5-v1.1-XXL and CLIP-ViT-L/14."
- classifier-free guidance: A diffusion sampling technique that mixes conditional and unconditional predictions to control conditioning strength. "classifier-free guidance scale 7.5"
- context parallelism: A distributed training strategy that shards the input context/sequence across devices to scale context length. "Context Parallelism"
- data parallelism: A training approach that replicates the full model across devices and synchronizes gradients after local updates. "Data Parallelism"
- Decentralized Diffusion Model (DDM): An ensemble of independently trained diffusion experts combined by a learned router at inference without cross-expert synchronization during training. "A Decentralized Diffusion Model (DDM) [6] removes that constraint."
- denoising step: A single iteration in diffusion sampling where noise is incrementally removed from the latent. "During inference, routing is performed at each denoising step,"
- DINOv2: A self-supervised vision model that produces general-purpose image features used for optional first-frame conditioning. "where DINOv2 is a self-supervised vision model that produces general-purpose image features,"
- DiT-B: A base-sized Diffusion Transformer architecture variant used here as the lightweight router. "The router is the lightweight counterpart to the expert pool, a DiT-B model of roughly 100M parameters."
- Euler-50 sampling: An ODE-based diffusion sampler using the Euler method with 50 steps. "using Euler-50 sampling,"
- FLUX.1-dev: A specific FLUX image-model checkpoint used to initialize video experts. "initialized only from FLUX.1-dev image weights [1]"
- flow matching: A generative training objective that learns a velocity field to transform noise into data by matching a target probability flow. "optimized with a flow-matching velocity objective that teaches it to turn noise into video"
- Fréchet Video Distance (FVD): A metric that measures distributional realism of generated videos relative to real videos. "Paris 2.0 cuts Fréchet Video Distance (FVD) from 561.04 to 279.01,"
- HunyuanVAE: A video autoencoder that compresses videos into latents and decodes latents back to video. "Videos are encoded into cached causal HunyuanVAE latents [4],"
- iso-FLOP: An experimental control where compared models use the same total floating-point operations (and often the same data). "comparison is iso-FLOP and iso-data,"
- latent: A compact internal representation of data used by generative models during encoding/decoding and diffusion. "the final latent is decoded to video by HunyuanVAE."
- MM-DiT: A multimodal Diffusion Transformer backbone for generating video conditioned on text (and optionally images). "Each expert is an 11B-parameter FLUX-style MM-DiT, the multimodal diffusion transformer backbone that generates the video,"
- monolithic model: A single synchronized model trained on a unified dataset using tightly coupled GPU infrastructure. "against a monolithic model trained on the same data"
- ODE solver: A numerical integrator used to advance the latent along a continuous-time diffusion trajectory. "drives an ODE solver to the next latent,"
- pipeline parallelism: A distributed training method that partitions model layers into stages across devices and pipelines minibatches through them. "Pipeline Parallelism"
- router: A lightweight model that assigns routing weights over experts during inference based on the current noisy latent and conditioning. "The router is trained independently of the experts, as a supervised cluster classifier over noisy latents."
- tensor parallelism: A strategy that shards tensors within layers across devices to increase model capacity without replication. "Tensor Parallelism"
- T5-v1.1-XXL: A very large T5 text encoder variant used to embed prompts for conditioning the video generator. "T5-v1.1-XXL and CLIP-ViT-L/14."
- Top-K: Selecting the K highest-scoring candidates (here, experts) to activate at each step. "Activated by router (Top-K experts)"
- velocity field: The model-predicted vector field over latents indicating how they should change to follow the generative flow. "selects one or more experts to evaluate the video velocity field."
- world models: Predictive models that simulate environment dynamics to forecast future states for agents. "world models that physical AI agents roll forward to predict how their actions reshape a scene."
Collections
Sign up for free to add this paper to one or more collections.