Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 158 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 36 tok/s Pro

GPT-5 High 35 tok/s Pro

GPT-4o 112 tok/s Pro

Kimi K2 177 tok/s Pro

GPT OSS 120B 452 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

Lumina-DiMOO: Unified Multi-modal Diffusion

Updated 9 October 2025

Lumina-DiMOO is an open-source unified multi-modal model that uses a fully discrete diffusion approach to synthesize and understand text, image, and mixed inputs.
It employs a joint tokenization scheme with parallel token refinement and advanced strategies like ML-Cache and Region-Adaptive Sampling to optimize speed and quality.
Benchmark results show it outperforms autoregressive baselines, offering a robust and extensible platform for next-generation multi-modal diffusion research.

Lumina-DiMOO is an open-source unified foundation model designed for comprehensive multi-modal generation and understanding, introducing a fully discrete diffusion modeling paradigm that enables accelerated, high-fidelity synthesis and reasoning over text, image, and mixed inputs. Distinct from autoregressive (AR) or hybrid diffusion-AR models, Lumina-DiMOO utilizes a joint tokenization scheme for all supported modalities, efficiently orchestrates parallel token refinement, and applies advanced caching strategies and region-adaptive sampling to optimize both throughput and output quality. It consistently surpasses previous open-source multi-modal models, achieving state-of-the-art results across standard generation and understanding benchmarks. Released with code and model checkpoints, Lumina-DiMOO provides a foundation for advances in discrete diffusion model research and versatile multi-modal applications (Xin et al., 7 Oct 2025).

1. Discrete Diffusion Architecture and Tokenization

Lumina-DiMOO adopts a fully discrete diffusion modeling approach in which all modalities—text, images, modality-specific boundary and layout tokens—are mapped to a joint vocabulary via modality-aware tokenizers. Images are downscaled (e.g., by a factor of 16×16) using a visual tokenizer such as aMUSEd-VQ, while text is encoded using a language tokenizer inherited from LLaDA. Control and structure-preserving special tokens (<IMAGE>, </IMAGE>, <end-of-line>, etc.) maintain information about modality boundaries and facilitate arbitrary resolution synthesis.

Generation and understanding tasks are unified through masked token prediction: the input sequence $x = (x_1, …, x_L)$ is partially masked (with mask set $\mathcal{M}$ ), and the transformer-based dLLM is trained to reconstruct missing content. The parallel prediction process is formalized as

$\forall i \in \mathcal{M}:~ \hat{x}_i = \text{[Mask]}$

and the probability distribution over masked tokens is

$p_\theta(\hat{x}_\mathcal{M} | \hat{x}_\mathcal{C}, c) = \prod_{i \in \mathcal{M}} p_\theta(\hat{x}_i | \hat{x}_\mathcal{C}, c)$

with objective

$\mathcal{L}(\theta) = \mathbb{E}_{x, m, \mathcal{M}} \left[ -\sum_{i \in \mathcal{M}} \log p_\theta(x_i | \hat{x}_\mathcal{C}, c) \right]$

where $c$ is optional conditioning (e.g., a prompt).

Compared to AR approaches, this discrete diffusion enables batched, parallel refinement, dramatically accelerating inference.

2. Sampling Efficiency and ML-Cache Strategy

Lumina-DiMOO achieves up to a 32-fold acceleration in text-to-image generation over AR baselines through block-wise, semi-autoregressive parallel prediction. To further reduce redundant computation, the ML-Cache strategy monitors maximal logits per token across refinement steps: tokens with consistently high logit values are marked as stable and their representations are cached. At each step, these cached tokens bypass recomputation, regulated by a global cache_ratio and periodic warmup/refresh intervals to avoid drift. This cache-based acceleration can double generation speed with negligible quality degradation.

The model also incorporates adaptive masking schedules for iterative refinement—masking rates and cache operations are dynamically chosen based on current sequence statistics, maximizing both global semantic fidelity and local feature detail.

3. Region-Adaptive Sampling in Diffusion Transformers

For image synthesis and editing, Lumina-DiMOO leverages Region-Adaptive Sampling (RAS), exploiting the flexibility of diffusion transformers (DiTs) to differentially allocate computational effort. At each sampling step, a noise-derived metric assesses which image regions (tokens/patches) require refinement:

$R_t = \text{mean}_{\text{patch}}(\text{std}(\hat{N}_t)) \cdot \exp(k \cdot D_{\text{patch}})$

where $\hat{N}_t$ is predicted noise, $\text{std}(\hat{N}_t)$ signals refinement necessity, $D_{\text{patch}}$ counts previous drop events, and $k$ is a tunable factor.

Regions with low noise variance, typically corresponding to foreground subjects, receive full updates; background or low-interest regions reuse cached noise from previous steps. Due to strong temporal consistency in DiT focus, redundant computation is efficiently suppressed without loss of structural coherence. Quantitative benchmarks (e.g., FID, sFID, CLIPScore) and user studies confirm that RAS achieves up to 2.51× acceleration on Lumina models with minimal perceptual impact (Liu et al., 14 Feb 2025).

4. Ratio-Aware Adaptive Guidance (RAAG)

In text-to-image regimes, classifier-free guidance (CFG) typically uses a fixed guidance scale $w$ across reverse diffusion steps:

$V_{\text{cfg}}(x_t, c) = V_u(x_t) + w \cdot (V_c(x_t, c) - V_u(x_t))$

where $V_u$ is unconditional and $V_c$ conditional velocity prediction.

RAAG identifies instability from RATIO spikes in early steps:

$p(x_t, c) = \frac{||V_c(x_t, c) - V_u(x_t)||^2}{||V_u(x_t)||^2}$

and introduces an adaptive exponential decay for guidance:

$w(p) = 1 + (w_{\max} - 1) \exp(-\alpha p)$

Dynamically decreasing $w$ when $p$ is high prevents semantic drift and error amplification in fast/low-step sampling. In Lumina-DiMOO, this yields up to 4× speed improvements and enhanced alignment and robustness, with extensive ablations confirming stability and generalization across scheduler types (Zhu et al., 5 Aug 2025).

Lumina-DiMOO natively supports:

Text-to-Image generation: Arbitrary resolution synthesis, object binding, color fidelity, spatial layout preservation (via <end-of-line>).
Image-to-Image tasks: Editing, subject-driven generation, style transfer, multi-view synthesis, inpainting, extrapolation.
Multi-modal understanding: Captioning, visual question answering, reasoning tasks via block-wise semi-autoregressive inference.

On standard benchmarks (GenEval, DPG, UniGenBench, OneIG-EN), Lumina-DiMOO achieves leading scores in object binding, color fidelity, and prompt alignment. A reinforcement learning stage (Self-GRPO) is applied for enhanced reward-weighted log-likelihood training, further boosting generative and understanding alignment. For example, GenEval overall scores reach around 88% (Xin et al., 7 Oct 2025).

6. Open-Source Release and Research Community Impact

The full codebase and model checkpoints are openly released, enabling reproducibility, extension to additional modalities (e.g., video, audio), and integration with domain-specific data. This facilitates independent validation, fosters collaborative research in discrete diffusion modeling, and supports development of versatile systems for artificial general intelligence (AGI) tasks, merging multi-modal reasoning and synthesis in a robust, scalable manner.

Lumina-DiMOO’s innovations in discrete tokenized diffusion, sampling optimization (RAS, ML-Cache, RAAG), and architecture generalize beyond text and image modalities. The masked modeling and joint token vocabulary approach provide a scalable foundation for multi-modal models, and its empirical benchmark results set a precedent for unified multi-modal frameworks in academic and industrial contexts. The open-source release addresses reproducibility and extensibility—key demands for the advancement of multi-modal generative AI.

A plausible implication is that further research on discrete diffusion, region-adaptive sampling, and dynamic guidance will lead to broader applicability and improved performance in next-generation multi-modal semantic systems.

PDF Markdown Chat (Pro)

References (3)

Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding (2025)

Region-Adaptive Sampling for Diffusion Transformers (2025)

RAAG: Ratio Aware Adaptive Guidance (2025)

Follow Topic

Get notified by email when new papers are published related to Lumina-DiMOO.

Lumina-DiMOO: Unified Multi-modal Diffusion

1. Discrete Diffusion Architecture and Tokenization

2. Sampling Efficiency and ML-Cache Strategy

3. Region-Adaptive Sampling in Diffusion Transformers

4. Ratio-Aware Adaptive Guidance (RAAG)

6. Open-Source Release and Research Community Impact

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Lumina-DiMOO: Unified Multi-modal Diffusion

1. Discrete Diffusion Architecture and Tokenization

2. Sampling Efficiency and ML-Cache Strategy

3. Region-Adaptive Sampling in Diffusion Transformers

4. Ratio-Aware Adaptive Guidance (RAAG)

5. Supported Multi-Modal Tasks and Benchmark Results

6. Open-Source Release and Research Community Impact

7. Context Within Discrete Diffusion and Multi-Modal Modeling

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research