Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 83 tok/s

Gemini 2.5 Pro 34 tok/s Pro

GPT-5 Medium 40 tok/s Pro

GPT-5 High 33 tok/s Pro

GPT-4o 115 tok/s Pro

Kimi K2 175 tok/s Pro

GPT OSS 120B 474 tok/s Pro

Claude Sonnet 4 40 tok/s Pro

2000 character limit reached

DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers (2503.14487v1)

Published 18 Mar 2025 in cs.CV and cs.AI

Abstract: Diffusion models have demonstrated remarkable success in various image generation tasks, but their performance is often limited by the uniform processing of inputs across varying conditions and noise levels. To address this limitation, we propose a novel approach that leverages the inherent heterogeneity of the diffusion process. Our method, DiffMoE, introduces a batch-level global token pool that enables experts to access global token distributions during training, promoting specialized expert behavior. To unleash the full potential of the diffusion process, DiffMoE incorporates a capacity predictor that dynamically allocates computational resources based on noise levels and sample complexity. Through comprehensive evaluation, DiffMoE achieves state-of-the-art performance among diffusion models on ImageNet benchmark, substantially outperforming both dense architectures with 3x activated parameters and existing MoE approaches while maintaining 1x activated parameters. The effectiveness of our approach extends beyond class-conditional generation to more challenging tasks such as text-to-image generation, demonstrating its broad applicability across different diffusion model applications. Project Page: https://shiml20.github.io/DiffMoE/

Collections

Summary

The paper introduces DiffMoE, a Diffusion Transformer architecture that uses dynamic token selection and a Mixture-of-Experts model to adaptively allocate computation.
DiffMoE achieves state-of-the-art results on ImageNet image generation, outperforming much larger standard models with similar or lower average computational costs.
This dynamic approach offers a path to building highly scalable and efficient diffusion models by focusing computation where it is most needed during generation.

Here's a summary of the paper "DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers" (2503.14487):

Rationale and Problem Solved

Problem: Standard diffusion models for image generation treat all parts of the generation process and all types of images uniformly. This is inefficient because generating different details or handling varying noise levels might require different amounts of computation. Existing methods using Mixture-of-Experts (MoE) to make computation more dynamic had limitations.
Goal: Improve the efficiency and performance of Diffusion Transformers (DiTs) by making them dynamically allocate computational resources based on the complexity of the task at hand (e.g., noise level, image content).

Data Used

ImageNet: Primarily used for class-conditional image generation experiments (specifically 256x256 resolution).
Text-to-Image Datasets: Used to demonstrate the approach's effectiveness in text-to-image generation, although specific datasets are not detailed in the abstract, they are used in the full paper experiments (e.g., for GenEval benchmark).

Model Architecture

Backbone: Uses a Diffusion Transformer (DiT) as the base architecture.
Mixture-of-Experts (MoE): Replaces standard feed-forward network blocks in the Transformer with MoE layers. Each MoE layer has multiple "expert" networks.
Batch-level Global Token Pool (Training): During training, tokens (image patches) from all images in a batch are pooled together. Experts select tokens from this global pool, allowing them to specialize better by seeing a wider variety of inputs (different images, noise levels, conditions).
Capacity Predictor (Inference): A small neural network trained alongside the main model. During inference (image generation), it predicts which tokens are "harder" and need processing by more experts. This allows the model to dynamically decide how much computation to use for each token based on its relevance or difficulty.
Dynamic Threshold: An adaptive mechanism used during inference to control the overall computational cost, ensuring it stays comparable to a standard dense model on average while allowing flexibility.

Performance on Benchmarks

ImageNet Generation: DiffMoE achieves state-of-the-art results among diffusion models on the ImageNet 256x256 benchmark.
- It significantly outperforms standard (dense) DiT models, even those using 3 times more activated parameters, while DiffMoE uses only 1x activated parameters on average.
- It also surpasses previous MoE approaches for diffusion models.
Text-to-Image Generation: DiffMoE shows improved performance compared to a similarly sized dense DiT model on text-to-image benchmarks like GenEval.
Efficiency: Achieves better results with comparable or lower average computational cost compared to large dense models.

Implications and Possible Applications

More Efficient High-Resolution Image Generation: Enables generating high-quality images more efficiently by focusing computation where it's most needed.
Improved Text-to-Image Models: Can enhance the quality and efficiency of models that generate images from text descriptions.
Scalability: Offers a path to building larger, more capable diffusion models without a proportional increase in computational cost during inference.
Adaptive Computation: Demonstrates a general principle for making large generative models more adaptive and efficient by dynamically allocating resources based on input complexity.

In conclusion, DiffMoE introduces a more efficient and powerful way to build diffusion models by incorporating a specialized MoE approach with dynamic computation allocation, leading to state-of-the-art performance in image generation tasks while managing computational resources effectively.