Nucleus-Image: Sparse MoE for Image Generation

Published 14 Apr 2026 in cs.CV | (2604.12163v1)

Abstract: We present Nucleus-Image, a text-to-image generation model that establishes a new Pareto frontier in quality-versus-efficiency by matching or exceeding leading models on GenEval, DPG-Bench, and OneIG-Bench while activating only approximately 2B parameters per forward pass. Nucleus-Image employs a sparse mixture-of-experts (MoE) diffusion transformer architecture with Expert-Choice Routing that scales total model capacity to 17B parameters across 64 routed experts per layer. We adopt a streamlined architecture optimized for inference efficiency by excluding text tokens from the transformer backbone entirely and using joint attention that enables text KV sharing across timesteps. To improve routing stability when using timestep modulation, we introduce a decoupled routing design that separates timestep-aware expert assignment from timestep-conditioned expert computation. We construct a large-scale training corpus of 1.5B high-quality training pairs spanning 700M unique images through multi-stage filtering, deduplication, aesthetic tiering, and caption curation. Training follows a progressive resolution curriculum (256 to 512 to 1024) with multi-aspect-ratio bucketing at every stage, coupled with progressive sparsification of the expert capacity factor. We adopt the Muon optimizer and share our parameter grouping recipe tailored for diffusion models with timestep modulation. Nucleus-Image demonstrates that sparse MoE scaling is a highly effective path to high-quality image generation, reaching the performance of models with significantly larger active parameter budgets at a fraction of the inference cost. These results are achieved without post-training optimization of any kind: no reinforcement learning, no direct preference optimization, and no human preference tuning. We release the training recipe, making Nucleus-Image the first fully open-source MoE diffusion model at this quality.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a sparse MoE diffusion transformer that activates only ~2B of 17B parameters per pass for efficient image generation.
It employs expert-choice routing with adaptive modulation to achieve near-perfect load balancing and optimize memory-bandwidth usage.
The model outperforms benchmarks in spatial relations, prompt fidelity, and text rendering, paving the way for scalable, cost-effective deployment.

Nucleus-Image: Sparse Mixture-of-Experts Diffusion Transformer for Efficient High-Quality Image Generation

Model and Corpus Design

Nucleus-Image introduces a sparse MoE diffusion transformer architecture aimed at maximizing image generation quality while minimizing inference cost. The model comprises 32 transformer layers with 17B parameters in total, with only ~2B parameters activated per forward pass via expert-choice routing across 64 experts per layer. This design leverages expert parallelism for scalability, enabling near-perfect load-balancing for distributed training.

Text tokens are excluded from the transformer backbone, participating solely through joint attention as KV contributors. The architecture employs adaptive modulation conditioned on diffusion timestep, RMSNorm-based QK normalization, and custom fused Triton kernels for memory-bandwidth efficiency.

The dataset is constructed from 700M curated images with 1.5B multi-granularity caption pairs, incorporating multi-aspect-ratio bucketing, progressive aesthetic and quality scoring, and provenance-based textual supervision. Training follows a progressive curriculum over resolutions 256→512→1024, employing resolution-aware episodic bucketing and dynamic capacity factor schedules for expert sparsification.

Sparse MoE Architecture and Routing Dynamics

Nucleus-Image replaces dense FFN layers with sparse MoE layers (from layer 3 onward), using 64 per-layer routed experts plus a shared expert. Expert-choice routing is utilized, wherein each expert selects its top-k tokens, enforcing uniform utilization and obviating the need for auxiliary load-balancing losses typical in token-choice MoE. Routing decisions are decoupled from computation—routers operate on unmodulated token states concatenated with timestep embeddings, while expert MLPs use modulated representations. This separation prevents timestep-scale collapse, yielding stable, semantically differentiated expert assignments.

Expert allocation analysis reveals that the routing mechanism dynamically adjusts expert specialization and allocation based on spatial saliency and denoising phase. Foreground regions (object boundaries, fine textures, text) attract higher expert density and lower diversity (stable specialization), while backgrounds exhibit lower allocation and higher diversity (frequent reassignment). Temporal analysis demonstrates a transition from diffuse allocation at early noise levels to sharp, content-aligned specialization at late timesteps.

Training Protocol and Optimization

The optimization pipeline employs the Muon optimizer with RMS-norm learning rate scaling, AdamW for modulation and projection parameters, and no-decay AdamW for biases/norms/router gates. The Warmup-Stable-Merge learning rate schedule (WSM) eliminates the need for online EMA by post-hoc checkpoint merging, enhancing both scalability and training flexibility.

Loss functions comprise rectified flow MSE (primary objective), wavelet-domain supervision (high-frequency detail), z-loss for router logit regularization, and orthogonal loss for expert diversity. Expert-choice routing guarantees balanced allocation, supported by z/orthogonal regularizers for optimal specialization.

Parallelism is implemented via FSDP2 and expert parallelism, with expert weights sharded across devices and all-to-all communication for token routing. Custom Triton kernels for grouped matrix multiplication, activation, normalization, and permutation facilitate efficient computation under device-expert mapping.

Inference and Efficiency

Inference optimizations include text embedding caching (no repeated computation across denoising steps), GQA for reduced KV memory, exclusion of text tokens from FFN/MoE layers, and joint attention over image+text via concatenated KV—translating architectural design directly into low-latency inference.

Evaluation and Results

Nucleus-Image is evaluated on GenEval, DPG-Bench, and OneIG-Bench at 1024×1024 resolution, 50 steps, CFG=8.0. All benchmarks are run solely on the base model, with no post-training tuning or RL/DPO.

GenEval (object compositionality, spatial relations): Nucleus-Image achieves an overall score of 0.87, matching Qwen-Image and outperforming GPT Image 1. High (0.84) and Seedream 3.0 (0.84). It is the highest-scoring model for spatial position (0.85).
DPG-Bench (dense prompt fidelity): Highest overall score of 88.79, leading Qwen-Image (88.32) and Seedream 3.0 (88.27) and outperforming existing open models. It ranks first in entity, attribute, and "other" subcategories.
OneIG-Bench (alignment, text rendering, reasoning, style, diversity): Score of 0.522, ahead of Imagen4 (0.515) and Recraft V3 (0.502). Demonstrates strong text rendering (0.825) and style (0.430), with diversity as a remaining limitation.

Averaged across normalized benchmarks, Nucleus-Image reaches 76.00, competitive with several leading models with active parameter budgets several times larger.

Implications and Future Directions

The Nucleus-Image study establishes sparse MoE scaling as a practical and effective axis for large-scale image generation. The decoupled routing strategy addresses unique failure modes in diffusion transformers, specifically preventing collapse under timestep modulation, and the joint attention design efficiently leverages textual supervision without incurring compute overhead. Training efficiency is enhanced via progressive sparsification and curriculum-driven data allocation, demonstrating that quality and efficiency trade-offs can be tightly managed.

Practically, the model’s ability to deliver near state-of-the-art performance with only ~2B active parameters per pass sketches a clear path toward scalable deployment and cost-effective serving. The open release of model weights, data, and code further fosters rapid community iteration and reproducibility. Theoretically, the demonstrated properties of content-aware sparsity, temporal routing coherence, and noise-adaptive allocation suggest promising directions for future research in sparse expert architectures for generative modeling and diffusion-based vision tasks.

Anticipated future directions involve further refinement of expert specialization techniques (e.g. hybrid spatial-expert routing), exploration of multi-modal expert allocations, and integration of advanced curriculum learning protocols for robust scaling across diverse visual domains. The modular nature of the architecture should facilitate rapid experimentation with new attention schemes and layer-wise sparsity patterns. Additionally, MoE diffusion strategies could be extended to other generative domains (e.g., video, 3D).

Conclusion

Nucleus-Image validates sparse mixture-of-experts as an efficient and high-fidelity scaling strategy for diffusion-based image generation. By activating only a small fraction of total parameters, it achieves state-of-the-art or near-state-of-the-art results on key benchmarks, all without post-training preference optimization. The architectural, optimization, and systems contributions harmonize to deliver both quality and efficiency, and the open release is positioned to accelerate research in scalable generative modeling and sparse architectures for vision-language tasks (2604.12163).

Markdown Report Issue

Paper to Video (Beta)

All Videos Subscribe on YouTube

Whiteboard

Nucleus-Image: Sparse MoE for Image Generation

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper introduces Nucleus-Image, a new AI model that turns text descriptions into images. Its big idea is to get top image quality while using much less computing power than similar systems. It does this by using a “team of specialists” design inside the model so only the right parts work at the right time, saving time and memory. The team also built a very clean, very large training dataset and share the full model, code, and data openly.

What questions were they trying to answer?

Can a text-to-image model reach the quality of the best models while using far fewer computations per image?
Can “mixture-of-experts” (a team-of-specialists design) be made stable and efficient for image generation?
How do better data cleaning, better captions, and a careful training schedule improve results without extra tricks like human preference tuning?
Can we open-source a state-of-the-art model at this quality level?

How did they build and train the model?

A model built like a team of specialists

Think of the model as a big art studio with many specialist artists (called “experts”). Each time the model generates an image, a smart “router” picks only a few of those experts to work, based on what the current image pieces need. This design is called a sparse Mixture-of-Experts (MoE).

Total “studio size”: about 17 billion parameters (the knobs the model can adjust).
Active per step: only ~2 billion actually work at once, which is much cheaper than using all 17 billion every time.
The core engine is a diffusion transformer:
- Diffusion: the model starts from noisy static and steadily “denoises” it into a clear picture.
- Transformer: a pattern-recognizing system that learns how pieces of an image relate to each other and to the text.

A few key choices made this both accurate and fast:

Expert-Choice Routing: instead of each image piece choosing experts, each expert chooses which tokens (image pieces) it will work on. This keeps all experts busy and avoids complicated balancing tricks.
Decoupled routing: the “traffic cop” (router) decides who works on what using a version of the signal that isn’t overly affected by the current noise level. This keeps choices stable across the diffusion steps, but still lets the model adjust based on how noisy the image is.
Efficient text use: the model doesn’t run text tokens through the whole transformer. Instead, the image pays attention to the text when needed (cross-attention), and the text understanding is reused across steps to save compute.

The image “tokenizer”

Before training, images are compressed into a small grid of numbers (like a reduced puzzle) by a VAE (Variational Autoencoder), so the model can work faster. After generation, the VAE turns the grid back into a full image.

Building a high‑quality training set

The team built a dataset of about 700 million unique images with ~1.5 billion image–caption pairs. They focused on quality and variety:

Careful cleaning: remove broken, tiny, blurry, watermarked, unsafe, or near-duplicate images.
Quality scoring: real photos get an “aesthetic” score (how appealing they look); synthetic images are ranked with other signals.
Smarter captions: each image can have several captions (short, medium, detailed). If existing web text matches the image well, they keep it; if it’s mediocre, they refine it; if it’s poor, they generate a new caption from the image.
Text-in-image practice: they add special synthetic data where text is rendered onto images in many fonts and languages, so the model learns to draw readable in-image text.

A training curriculum that grows in difficulty

Training progresses like school grades:

Start small (256px), then 512px, then 1024px, always including multiple aspect ratios (not just squares).
As resolution grows, the model gradually uses fewer experts per token (it’s already learned a lot by then).
Data is split into 8 curriculum “buckets” from easier/rougher to harder/higher quality, and the training shifts toward later, higher-quality buckets over time.

Stability and speed under the hood

The team made several engineering choices to keep training stable and fast:

First few layers are “dense” (no experts) to stabilize early learning.
Normalize attention to avoid exploding numbers.
Gated residuals (like a volume knob) keep updates balanced.
Custom high-speed GPU kernels and grouped operations pack work efficiently and reduce memory traffic.
An optimizer and learning-rate schedule (“Muon” + “Warmup‑Stable‑Merge”) that remove the need for extra weight copies, simplifying training.

What did they find?

High quality at lower cost: On public tests (GenEval, DPG‑Bench, OneIG‑Bench), Nucleus-Image matches or beats leading models while activating only ~2B parameters per step. That means better “quality vs compute” trade-offs (a new Pareto frontier: you get more quality for the same cost, or the same quality for less cost).
Stable MoE for diffusion works: Their routing design and training tricks keep the specialists balanced and effective across the many noise steps in diffusion.
Data matters a lot: Clean, diverse, well-captioned data and a staged curriculum significantly improve results—no extra preference fine-tuning or reinforcement learning required.
Practical inference: The architecture choices (like sharing text attention across steps and using grouped attention heads) reduce memory use and speed up image generation.

Why does this matter?

Faster, cheaper, high‑quality image generation: You can get top-tier images without needing massive compute, making powerful models more accessible.
Better handling of text in images: The extra synthetic text training helps the model draw readable words in many styles and languages.
Clear recipe for others: The paper gives a detailed, reproducible approach—data pipeline, training schedule, and engineering tricks—that others can build on.
Open ecosystem boost: They released model weights, code, and data, making this (at the time of writing) the first fully open‑source MoE diffusion model at this quality level. That accelerates research, education, and responsible innovation.

In short

Nucleus-Image shows that a smart “team-of-specialists” design, paired with clean data and a sensible training plan, can deliver excellent text-to-image results while using far fewer computations. It’s fast, high quality, and open—pushing the field forward in a practical, shareable way.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, phrased to guide follow-up research.

Benchmark transparency: Exact per-benchmark scores, confidence intervals, and test-time settings (sampler, steps, guidance, seeds) for GenEval, DPG-Bench, and OneIG-Bench are not reported, hindering reproducibility and fair comparisons.
Human evaluation: No human preference studies are provided to validate automated metrics or to assess subjective qualities (aesthetics, faithfulness, artifact rates).
Compute-quality Pareto: The claimed “new Pareto frontier” lacks quantitative latency/throughput, memory, and energy measurements across hardware and batch sizes, and comparisons at matched quality.
Inference settings: The sampling algorithm, step counts, guidance strategy (e.g., CFG vs alternatives), and their impact on quality vs latency are unspecified.
Text KV sharing across timesteps: The paper claims joint attention with text KV reuse but provides no ablation on memory/latency savings, quality impacts, or degradation on very long or multilingual prompts.
Cross-attention–only text conditioning: Excluding text tokens from the backbone may limit compositional or long-context reasoning; no comparative ablation against dual-stream or text-in-backbone designs.
Expert-choice vs token-choice routing: No controlled ablation shows the quality, stability, and efficiency trade-offs between expert-choice and token-choice for diffusion transformers.
Decoupled routing efficacy: The benefits of routing on unmodulated states with separate timestep input are not quantified (e.g., collapse rate, expert entropy, quality at different $t$ ).
Expert specialization analysis: No evidence (e.g., probes, saliency, clustering) that experts specialize by semantics, spatial regions, or timesteps; specialization dynamics over training remain unknown.
Routing robustness: Sensitivity of routing to prompt distribution shifts, rare concepts, or extreme timesteps is unreported; failure modes (e.g., collapse under heavy conditioning) are not cataloged.
Capacity factor schedules: The per-layer and per-stage capacity schedules are not ablated; optimality across resolutions, prompt lengths, and datasets is unknown.
Shared expert design: The necessity and sizing of the shared expert are not studied; its contribution to stability, quality, and specialization remains unquantified.
Dense-initial-layers requirement: Using dense FFNs in the first 3 blocks is asserted necessary, but the trade-off with 0/1/2/4+ dense layers is not experimentally clarified.
QK-Norm and tanh-gating: Stabilization choices (frozen QK-RMSNorm, tanh-bounded residuals) lack ablations vs alternatives (e.g., learnable scaling, attention clipping), and their interaction with MoE is not studied.
Muon optimizer and learning-rate schedule: The “Warmup-Stable-Merge” schedule and optimizer tuning are not detailed or ablated; sensitivity to hyperparameters and removal of EMA is not validated against baselines.
Training stability metrics: There is no reporting of gradient norms, loss spikes, expert load statistics, or dropout/collapse rates over training to substantiate stability claims.
Scaling laws for MoE in diffusion: No study of how quality scales with number of experts, expert width, active parameter budget, or layer count for fixed compute.
Expert-parallel all-to-all costs: Network traffic, latency, overlap with compute, and scaling efficiency across nodes are not quantified or compared to alternative parallelization strategies.
Kernel-level gains: Triton/liger/Flash3 fusions are described but their empirical speedups, memory savings, and kernel-level profiling (vs PyTorch baselines) are not reported.
VAE bottleneck and alternatives: Using Qwen-Image VAE (16ch) is fixed; no evaluation of VAE choice on text fidelity, color shifts, or high-frequency detail, nor of 8/32-channel variants.
Resolution limits: Training tops at 1024; quality/efficiency at 2K–4K, tiling strategies, and super-resolution integration remain unexplored.
Multiaspect training: While aspect-ratio bucketing is used throughout, its impact on generalization, artifacts at extreme ARs, and decoding consistency are not measured.
Dataset composition and bias: Demographic/cultural balance, domain mix, and geographic/language distributions are not characterized; downstream bias and fairness evaluations are absent.
Safety evaluation: Beyond filtering, no post-training red-teaming or safety benchmarks (e.g., harmful content, stereotypes, private data leakage) are reported.
Memorization risk: Despite deduplication, no experiments quantify memorization (near-duplicate regeneration, copyright risk) or the effect of dedup radius thresholds on overfitting.
Benchmark contamination: There is no analysis of potential train–test leakage with benchmarks (e.g., OneIG-Bench), nor procedures to ensure decontamination.
Caption policy at train time: With multi-granularity captions stored, the selection/mixing policy during training and its effect on compositionality and faithfulness are not described or ablated.
Multilingual text rendering: The synthetic text pipeline spans multiple scripts, but there is no quantitative evaluation (OCR accuracy, edit distance) by language, font, or layout complexity.
Text generation fidelity: No targeted metrics for in-image text quality (legibility, spelling, formatting) on standardized benchmarks to validate the synthetic text augmentation.
Auxiliary tasks: Although multiple tasks are supported (inpainting, colorization, zoom), there is no evaluation on these tasks, and no reporting of their sampling weights or utility.
Generalization to long/complex prompts: Robustness to long, nested, or compositional prompts and performance under rare concept combinations are not assessed.
Non-English prompts: End-to-end performance with non-English prompts is not evaluated, despite multilingual captioning and text rendering.
Effect of synthetic vs real data: The ratio and interaction between synthetic and real data are not ablated; impacts on photorealism vs stylization remain unclear.
Curriculum design: The 8 episodic buckets and their schedules lack ablations; sensitivity to weighting policies and their interaction with quality tiers is unknown.
Preference tuning absence: While no RLHF/DPO is used, the trade-offs (e.g., prompt adherence vs aesthetics vs safety) and how preference tuning would interact with MoE routing remain untested.
Reproducibility details: Total training tokens/FLOPs, wall-clock time, hardware profile, and checkpointing strategy are not provided for faithful reproduction.
Fine-tuning and adaptation: Procedures for efficient domain adaptation (e.g., LoRA with MoE gating, expert freezing/swap-in) and their effects on stability/quality are not explored.
Robustness under distribution shift: Performance on OOD domains (medical, scientific diagrams, satellite imagery) and degradation patterns are unreported.
Failure analysis: No qualitative/quantitative catalog of common failure modes (anatomy errors, hands, small object counts, spurious text, background artifacts) to target future improvements.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are near-term, deployable uses that build directly on the model, data pipeline, and systems innovations reported in the paper.

High-quality, low-cost text-to-image generation for commercial creatives
- Sectors: media/entertainment, advertising, e-commerce, gaming, design
- What: Use Nucleus-Image as an API or plugin (e.g., Figma, Adobe) to generate product photos, ads, posters, concept art, storyboards, thumbnails, and mood boards at a lower inference cost due to ~2B active parameters per step.
- Why it’s enabled: Sparse MoE diffusion transformer with expert-choice routing delivers leading benchmark performance at a fraction of inference compute; joint attention with text KV sharing reduces repeated text compute.
- Dependencies/assumptions: Access to a GPU with sufficient memory (for the DiT + text encoder + VAE); integration of the released weights/code; licensing compliance for the released model and training data.
Better in-image typography and multilingual text rendering
- Sectors: marketing, publishing, SMBs, education
- What: Generate posters, banners, flyers, book/magazine covers, and social posts where text is legible and stylistically varied (fonts, layouts, languages).
- Why it’s enabled: Dedicated synthetic text-rendering data stream and curriculum improve accurate text generation.
- Dependencies/assumptions: Prompt engineering or UI affordances that expose typography control; reliance on the released VAE and tokenizer; multilingual font support in downstream tools.
Cost-efficient content pipelines for e-commerce and product marketing
- Sectors: e-commerce, retail, marketplaces
- What: Generate product hero shots, packaging concepts, and lifestyle scenes with coherent branding and copy.
- Why it’s enabled: High-quality generation with accurate text, progressive resolution training, and data curation for high-fidelity outputs.
- Dependencies/assumptions: Brand-specific fine-tuning or LoRA adapters (supported by open weights but requires additional training); internal QA to ensure brand safety.
Open, reproducible baseline for MoE diffusion research and teaching
- Sectors: academia, R&D labs
- What: Use the released weights, training code, and dataset to replicate results, run ablations, and teach large-scale vision model engineering (data filtering, MoE routing, Triton kernels, distributed training).
- Why it’s enabled: First fully open-source MoE diffusion model at this quality tier; detailed data schema and training curriculum.
- Dependencies/assumptions: Compute availability for re-training or fine-tuning; familiarity with PyTorch, Triton, and distributed training.
Dataset curation blueprint for web-scale image-text corpora
- Sectors: ML platforms, data engineering, regulated industries
- What: Adopt the Parquet-based metadata schema, multi-tier filtering (CPU/GPU), deduplication (hash + pHash), CLIP-based alignment routing, and multi-granularity captioning for internal dataset building.
- Why it’s enabled: Detailed, modular pipeline with deterministic manifests, provenance tracking, and curriculum-aware sampling.
- Dependencies/assumptions: Access to GPU inference for scoring/filters (DALI, NeMo Curator); organizational policies to handle licensing, safety, and provenance.
Performance-engineering components for MoE and diffusion systems
- Sectors: AI infrastructure, MLOps, cloud providers, framework devs
- What: Integrate fused Triton kernels (gated residual, LayerNorm-scale, fused sequences), grouped matrix multiplications, FlashAttention-3, and token-permutation kernels to speed up MoE and DiT workloads.
- Why it’s enabled: The paper provides specific kernels and patterns that reduce memory bandwidth and kernel launch overhead.
- Dependencies/assumptions: Triton/PyTorch compatibility; attention to numerical parity and stability in production.
Inference servers that share text KVs across diffusion timesteps
- Sectors: model serving, SaaS generators
- What: Build inference services that avoid recomputing cross-attention text KVs across timesteps to cut latency/cost.
- Why it’s enabled: Joint attention scheme with text KV sharing is designed into the architecture.
- Dependencies/assumptions: Custom serving logic to cache and reuse KVs; careful memory management; alignment with the chosen text encoder (Qwen3-VL-8B).
Responsible data governance workflows
- Sectors: policy, compliance, public sector, enterprise IT
- What: Adopt provenance-aware captions, safety classification filters, near-duplicate removal, and episodic curriculum tagging as auditable controls for dataset creation and updates.
- Why it’s enabled: Clear schema and invariants, immutable snapshots with per-run manifests, and explicit success/failure flags for observability.
- Dependencies/assumptions: Internal governance requirements; storage and catalog systems that can host immutable snapshots and logs.
Local creative tooling for prosumers and small businesses
- Sectors: SMBs, freelancers, content creators
- What: Run the open model on a single high-memory consumer GPU to produce marketing assets and product visuals without per-image API fees.
- Why it’s enabled: Reduced active parameter count compared to competing high-end models.
- Dependencies/assumptions: A 24–48 GB GPU may be required depending on batch size and precision; quantization or offloading may be needed for consumer hardware.

Long-Term Applications

These opportunities require further research, engineering, scaling, or cross-domain adaptation before broad deployment.

Real-time, on-device image generation on mobile/AR wearables
- Sectors: consumer devices, AR/VR, edge AI
- What: Interactive creative assistants generating imagery with legible text on-device.
- Why it’s plausible: Sparse MoE reduces active compute; fused kernels and KV sharing lower overhead.
- What’s needed: Quantization/distillation for NPUs/NPUs+GPUs, MoE-friendly runtimes on mobile, memory- and bandwidth-aware routing implementations.
Video generation with MoE diffusion transformers
- Sectors: film/TV, gaming, advertising
- What: Extend expert-choice routing and decoupled routing to video diffusion for high-quality, compute-efficient video synthesis.
- Why it’s plausible: The architectural stability and efficiency gains for images can transfer to temporal models.
- What’s needed: Temporal conditioning design, memory/KV reuse across frames, large curated video-text datasets, and expert parallelism tuned for sequence length.
Domain-specific generators for regulated fields (e.g., medical, scientific diagrams)
- Sectors: healthcare, scientific publishing, education
- What: Create accurate, domain-aligned visuals (e.g., anatomical diagrams, patient education leaflets with multilingual text).
- Why it’s plausible: Caption routing and multi-granularity supervision strengthen text-image grounding; improved text rendering.
- What’s needed: Curated, rights-cleared domain datasets; robust safety and factuality checks; alignment with domain experts; evaluation beyond aesthetics (e.g., diagnostic utility).
Large-scale synthetic data generation for perception models
- Sectors: robotics, autonomous systems, retail analytics
- What: Generate diverse, caption-rich images to augment training for detection and OCR-heavy tasks; leverage better typography for text-centric datasets.
- Why it’s plausible: Multi-granularity captions and synthetic text rendering yield controllable text and detailed descriptions.
- What’s needed: Consistent label generation (e.g., boxes/masks), bridging sim-to-real gap, task-specific conditioning and scoring.
Standardization of dataset governance schemas and audits
- Sectors: policy, standards bodies, enterprise governance
- What: Evolve the paper’s metadata schema (provenance, quality tiers, episodic buckets) into a sector-wide standard for dataset registries and audits.
- Why it’s plausible: The pipeline cleanly separates assets from metadata and supports immutable snapshots/manifests, aiding compliance and reproducibility.
- What’s needed: Cross-organization consensus, legal/ethical review, integration with content provenance (e.g., C2PA) and data licensing frameworks.
General-purpose MoE runtime libraries for heterogeneous accelerators
- Sectors: AI frameworks, cloud providers, chip vendors
- What: Production-grade libraries that implement expert-choice routing, grouped-mm, and all-to-all token routing across diverse hardware (GPUs, TPUs, custom NPUs).
- Why it’s plausible: The paper demonstrates throughput gains from specialized kernels and expert parallelism.
- What’s needed: Hardware-agnostic abstractions, compiler support for dynamic routing, scheduling that minimizes all-to-all overheads.
Interactive co-creation tools with controllable typography and layout
- Sectors: design, publishing, education technology
- What: Design assistants that let users specify text content, font families, layout grids, and multilingual constraints for precise outputs.
- Why it’s plausible: The training pipeline explicitly targets text-in-image fidelity; cross-attention design supports strong conditioning.
- What’s needed: UI/UX for fine-grained control, prompt-to-layout mappings, and fine-tuning for layout adherence.
Energy- and cost-aware AI content platforms
- Sectors: SaaS, cloud, sustainability initiatives
- What: Platforms that advertise lower carbon and cost footprints per image via sparse MoE and fused kernels, with dynamic provisioning based on capacity factors.
- Why it’s plausible: Lower active parameters and efficient kernels reduce energy per inference.
- What’s needed: Measurement frameworks for carbon-aware scheduling, capacity factor tuning at serve time, verifiable reporting.
Cross-lingual, culturally diverse visual assistants
- Sectors: global marketing, public sector communications, NGOs
- What: Generate culturally diverse, multilingual content for campaigns and public information.
- Why it’s plausible: Diverse training corpus and text rendering pipeline; curriculum emphasizes higher-quality, well-aligned captions in later stages.
- What’s needed: Bias auditing and mitigation, localized style controls, and governance to avoid harmful stereotypes.

Notes on Key Assumptions and Dependencies

Hardware/runtime: While only ~2B parameters are active per pass, MoE routing and expert parallelism benefit from high-bandwidth interconnects for multi-GPU setups; single-GPU inference may still require substantial memory.
Software stack: Gains rely on Triton kernels, grouped-mm, FlashAttention-3, and custom routing; productionization requires robust kernels across drivers and framework versions.
Data/IP: Adoption of the dataset pipeline assumes availability of rights-cleared data, enforceable safety filters, and documented provenance; organizations must align with internal and regional data policies.
Model components: Reported quality depends on the specified VAE (Qwen-Image VAE) and text encoder (Qwen3-VL-8B-Instruct); substitutions may alter performance/latency.
Generalization: Applying the methods to other modalities (video, medical) requires new data curation, task-specific evaluation, and often additional safety alignment.

View Paper Prompt View All Prompts

Glossary

All-to-all communication: A distributed communication pattern where every device exchanges data with all others, used here to move routed tokens between expert shards. Example: "Expert-parallel execution (described in Section~\ref{sec:parallelism}) distributes experts across devices, requiring all-to-all communication to route tokens to their assigned experts."
Aesthetic scoring: A model-based prediction of an image’s visual appeal used to rank and filter training data. Example: "Aesthetic scoring. For the real-image subset, a lightweight regression head on top of a frozen image encoder predicts a scalar aesthetic quality score, calibrated against human preference annotations in the spirit of prior work on aesthetic scoring for generative models \cite{taylor2026}."
Aspect-ratio bucketing: Grouping images by target crop shape so batches share spatial dimensions, improving training efficiency. Example: "Training follows a progressive resolution curriculum (256 to 512 to 1024) with multi-aspect-ratio bucketing at every stage"
Capacity factor: A control parameter for MoE layers that sets how many tokens each expert can process, affecting sparsity and load. Example: "The capacity factor controls how many tokens each expert may select and therefore governs the density of expert participation."
Caption granularity: The level of detail in captions (e.g., short, medium, detailed) used to vary supervision richness. Example: "The curated dataset exposes orthogonal axes of control over the training distribution: geometry (via aspect-ratio bucketing), quality (via static quality tiers and sampling weights), caption granularity (via provenance-based selection over short/medium/detailed captions), and supervision type (via a weighted task sampler across text-to-image and auxiliary tasks)."
Caption provenance: Metadata tags indicating the source or conditioning context of a caption to inform routing and sampling. Example: "Each granularity is produced by a dedicated captioning pass, and all variants are stored as separate entries in the captions list with corresponding caption_sources provenance tags."
CLIP-based image-text alignment score: A similarity score between image and text embeddings (from CLIP-like models) used to decide preserving, refining, or synthesizing captions. Example: "For images that arrive with existing web-scraped alt-text or metadata captions, an image-text alignment score $s$ (CLIP-based image-text alignment score) is computed and used to decide whether the original text is suitable for training."
Columnar representation: A table storage layout that stores data by columns to enable efficient filtering and analytics. Example: "We favor a columnar representation over a relational schema to maximize predicate pushdown and enable large-scale analytics."
Content-addressed key: An identifier derived from the content (e.g., a hash) used to store and reference media reliably. Example: "Each ingested image is assigned a stable content-addressed key and a corresponding metadata row that records its storage locator, source URL, original geometry, and a content hash."
Cross-attention: An attention mechanism where one sequence (e.g., image tokens) attends to another (e.g., text tokens) for conditioning. Example: "For text conditioning, we employ direct cross-attention from the text encoder's hidden states rather than processing text tokens through a separate stream."
Curriculum learning: A training strategy that schedules data from easier/broader to more difficult/high-fidelity over stages. Example: "We implement curriculum learning by assigning each example to one of $K=8$ episodic buckets derived from a composite curriculum score that combines image quality tier and resolution tier."
Decoupled routing: An MoE design where routing decisions use unmodulated features (plus timestep), while experts compute on fully modulated features. Example: "we introduce a decoupled routing design that separates timestep-aware expert assignment from timestep-conditioned expert computation."
Deduplication: Removing exact and near-duplicate images to increase data diversity and reduce overfitting. Example: "We apply a two-stage deduplication strategy."
Diffusion transformer: A transformer-based architecture that performs denoising steps in diffusion models for generation. Example: "Nucleus-Image is a diffusion transformer that employs a sparse mixture-of-experts (MoE) architecture to scale model capacity while maintaining computational efficiency."
EMA shadow copy: An Exponential Moving Average replica of model weights often used for stable evaluation or inference. Example: "Combined with a Warmup-Stable-Merge learning rate schedule, this eliminates the need for an EMA shadow copy of model weights."
Expert-choice routing: An MoE routing scheme where experts select their top tokens, encouraging balanced expert utilization. Example: "Unlike conventional token-choice routing where each token selects its top- $k$ experts, we employ expert-choice routing~\cite{zhou2022expertchoice,sun2024ecdit} where each expert selects its top- $k$ tokens."
Expert parallelism: A distributed training strategy where different experts are sharded across devices. Example: "The MoE layers are parallelized across devices using expert parallelism, where each device holds a subset of experts."
Flash Attention 3: A highly optimized attention kernel/library for efficient and memory-friendly attention computation. Example: "and Flash Attention 3~\cite{flash3} for variable-length attention computation."
Frequency-domain metrics: Measures computed after transforming images to the frequency domain, used here for blur detection. Example: "The final tier runs GPU-accelerated filters for safety classification, watermark detection, blur detection via frequency-domain metrics, and embedding-based quality scoring."
Grouped Matrix Multiplication: A single-kernel approach to compute multiple independent GEMMs of varying sizes efficiently. Example: "We instead leverage grouped matrix multiplication, which performs multiple matrix multiplications of potentially different sizes in a single kernel launch."
Grouped Query Attention (GQA): An attention variant that groups many query heads with fewer shared key-value heads to reduce memory. Example: "We adopt Grouped Query Attention (GQA)~\cite{ainslie2023gqa} with a 4:1 ratio, using 16 query heads and 4 key-value heads."
Key-value cache: Stored key/value tensors reused across steps to speed up attention during inference. Example: "This reduces the key-value cache by 4 $\times$ during inference with negligible impact on generation quality."
Liger Kernel: A library providing fused GPU kernels for common neural ops to improve performance. Example: "We additionally leverage the Liger Kernel library~\cite{hsu2024liger} for fused SwiGLU activations and RMSNorm operations"
Load-balancing losses: Auxiliary training losses that encourage uniform expert usage in MoE models. Example: "and uses expert-choice routing to keep expert utilization balanced without large auxiliary load-balancing losses."
Logit-normal timestep sampling distribution: A distribution used to sample diffusion timesteps, concentrating around certain noise levels. Example: "The design is particularly important given our logit-normal timestep sampling distribution, which concentrates samples near intermediate noise levels and would otherwise leave extreme timesteps with degraded routing diversity."
Mixture-of-Experts (MoE): An architecture where multiple expert networks process subsets of tokens, improving capacity at fixed compute. Example: "We replace the dense feed-forward network with a sparse mixture-of-experts layer in 29 of the 32 transformer blocks, following recent work on scaling diffusion transformers with MoE~\cite{fei2024ditmoe}."
Muon optimizer: A specific optimizer used for training, with parameter grouping tailored to the task. Example: "We adopt the Muon optimizer and share our parameter grouping recipe tailored for diffusion models with timestep modulation."
Multi-dimensional rotary position embedding (mRoPE): A positional encoding that extends rotary embeddings over multiple axes (e.g., time, height, width). Example: "we follow the multi-dimensional rotary position embedding (mRoPE) formulation from Qwen-VL~\cite{qwen-image}, which encodes separate frequency components for temporal, height, and width axes."
NVIDIA DALI: A GPU-accelerated data loading and preprocessing library. Example: "We orchestrate the GPU-resident stages of this pipeline using NVIDIA DALI \cite{dali} for accelerated image decode and batch transforms"
NVIDIA NeMo Curator: A toolkit for scalable, GPU-accelerated data curation and inference pipelines. Example: "and leverage NVIDIA NeMo Curator~\cite{nemocurator} for scalable GPU-accelerated inference across the heavy scoring passes like CLIP-based embedding, safety classification, and aesthetic quality heads."
Pareto frontier: The set of non-dominated trade-offs between two metrics, here quality and efficiency. Example: "We present Nucleus-Image, a text-to-image generation model that establishes a new Pareto frontier in quality-versus-efficiency"
Parquet: A columnar data storage format used for efficient analytics and filtering. Example: "The training dataloader consumes a flat Parquet metadata table with a consistent row schema (Table~\ref{tab:dataset-core-schema})."
Perceptual hashing (pHash): A hashing technique capturing image perceptual features to detect near-duplicates. Example: "Second, near-duplicates are identified using perceptual hashing (pHash) with a conservative Hamming radius; within each cluster, a single representative is retained based on resolution and quality score."
Predicate pushdown: Applying filters at the storage/scan layer to reduce data read and improve performance. Example: "We favor a columnar representation over a relational schema to maximize predicate pushdown and enable large-scale analytics."
Progressive resolution curriculum: A training schedule that increases image resolution over stages to stabilize and improve learning. Example: "Training follows a progressive resolution curriculum (256\, $\rightarrow$ \,512\, $\rightarrow$ \,1024; see Section~\ref{sec:training-stages})"
Progressive sparsification: Gradually reducing expert capacity (or sparsity settings) over training to manage compute and stability. Example: "coupled with progressive sparsification of the expert capacity factor."
QK-Norm: Applying normalization to query and key projections before attention score computation. Example: "Following recent work on attention stability~\cite{lumina-image-2}, we apply RMSNorm to queries and keys prior to computing attention scores (QK-Norm)."
Rectified flow supervision: A training approach for diffusion models using flow-matching/rectified flows as the supervision signal. Example: "The training stack pairs a large curated dataset with rectified flow supervision and a multi-stage curriculum that emphasizes higher-quality data later in training."
RMSNorm: Root Mean Square Layer Normalization, a normalization method that scales activations by their RMS. Example: "Following recent work on attention stability~\cite{lumina-image-2}, we apply RMSNorm to queries and keys prior to computing attention scores (QK-Norm)."
Shared expert: An expert applied to all tokens in an MoE layer, ensuring baseline computation and stability. Example: "Each MoE layer includes a shared expert that processes all tokens unconditionally."
SwiGLU: A gated activation function combining SiLU and linear transforms, often used in modern FFNs. Example: "Each MoE layer consists of 64 routed experts and one shared expert, where each expert is a SwiGLU~\cite{shazeer2020glu} network with hidden dimension 1344."
Tanh gating: Bounding residual gates with tanh to prevent exploding residuals. Example: "Tanh Gating. Residual connections use tanh-bounded gates: $\mathbf{x} \leftarrow \mathbf{x} + \tanh(\mathbf{g}) \odot \mathbf{r}$ ."
Text KV sharing: Reusing text key/value tensors across timesteps to reduce compute and memory during diffusion. Example: "using joint attention that enables text KV sharing across timesteps."
Token budget: A constraint on the number of tokens (e.g., from image crops) used for batching and model limits. Example: "At dataset initialization, the loader enumerates valid crop sizes for each base resolution under (i) a token budget constraint, (ii) an aspect-ratio ceiling, and (iii) VAE compatibility constraints (e.g., divisibility by the patchification stride)."
Token-choice routing: An MoE routing scheme where tokens select their top experts (contrasted with expert-choice). Example: "Unlike conventional token-choice routing where each token selects its top- $k$ experts, we employ expert-choice routing where each expert selects its top- $k$ tokens."
Triton kernels: Custom GPU kernels written in Triton to fuse operations and reduce memory traffic. Example: "We implement custom Triton kernels for several operations that are memory-bandwidth bound in the standard PyTorch implementation."
Variational Autoencoder (VAE): A generative model used here as an image tokenizer to map pixels to latents and back. Example: "The Variational Autoencoder (VAE) serves as the image tokenizer, compressing input images into compact latent representations for diffusion training and decoding them back to pixel space during inference."
Vision-LLMs (VLMs): Models that process images and text jointly, used for high-precision captioning domains. Example: "Higher-precision VLMs are reserved for domains where hallucination risk is elevated (diagrams, scientific figures, OCR-heavy content, and images containing embedded text)."
Warmup-Stable-Merge learning rate schedule: A learning rate policy that combines warmup and stabilization phases with a merge step. Example: "Combined with a Warmup-Stable-Merge learning rate schedule, this eliminates the need for an EMA shadow copy of model weights."
Watermark detection: Automated identification of visible watermarks to filter low-quality or restricted images. Example: "The final tier runs GPU-accelerated filters for safety classification, watermark detection, blur detection via frequency-domain metrics, and embedding-based quality scoring."

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Nucleus-Image: Sparse MoE for Image Generation

Summary

Nucleus-Image: Sparse Mixture-of-Experts Diffusion Transformer for Efficient High-Quality Image Generation

Model and Corpus Design

Sparse MoE Architecture and Routing Dynamics

Training Protocol and Optimization

Inference and Efficiency

Evaluation and Results

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

What questions were they trying to answer?

How did they build and train the model?

A model built like a team of specialists

The image “tokenizer”

Building a high‑quality training set

A training curriculum that grows in difficulty

Stability and speed under the hood

What did they find?

Why does this matter?

In short

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Notes on Key Assumptions and Dependencies

Glossary

Open Problems

Continue Learning

Collections

Tweets