Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 186 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 41 tok/s Pro
GPT-4o 124 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 440 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

FUJI-LDDMs: Joint Latent Diffusion Models

Updated 27 October 2025
  • FUJI-LDDMs are fully joint latent discrete diffusion models that integrate masked token updates with a continuous latent channel to capture global context.
  • The model architecture interleaves masked discrete diffusion with Gaussian-scheduled latent diffusion, enabling robust joint denoising and efficient context propagation.
  • Empirical results demonstrate that FUJI-LDDMs achieve lower perplexity and enhanced fluency, leading to improved performance in language and reasoning tasks.

FUJI-LDDMs (Fully Joint Latent Discrete Diffusion Models) are a class of discrete diffusion models that couple masked discrete diffusion for language or categorical data with a continuous latent diffusion channel. Developed to address fundamental limitations in masked denoising diffusion models—specifically, the tendency for token-level updates to factorize independently across sequence positions—FUJI-LDDMs introduce a latent channel to propagate joint contextual information, thereby improving fidelity and sample efficiency, especially in parallel or few-step generation settings (Shariatian et al., 20 Oct 2025, Jo et al., 22 Oct 2025).

1. Motivation and Background

Masked discrete diffusion models (MDLMs), as applied to language and categorical data, rely on a sequence of noising and denoising operations wherein tokens are incrementally masked and then reconstructed. In these models, reverse (denoising) transitions are typically performed independently at each sequence position, which means that complex dependencies among tokens (such as global syntax or semantic structure) are not well captured, particularly when many tokens are unmasked per step. This results in lack of coherence and degraded quality, particularly for non-autoregressive, parallel generation.

FUJI-LDDMs are formulated to address these shortcomings by introducing a continuous latent channel that is diffused in tandem with the masked discrete variables. The latent embeddings carry cross-token dependencies, disambiguate generative ambiguities, and allow the model to maintain and propagate more nuanced context information across denoising steps.

2. Generative Mechanism and Model Architecture

In FUJI-LDDMs, the state at each diffusion timestep is a pair Zt=(Xt,Yt)Z_t = (X_t, Y_t), where XtX_t is the (potentially masked) discrete sequence and YtY_t is a vector of continuous latent embeddings. The generative process consists of two interleaved Markov chains:

  • Forward (noising) process: Both the discrete tokens and the latent are progressively corrupted.
    • The data channel (XtX_t) follows a masked diffusion trajectory.
    • The latent channel (YtY_t) is diffused according to a Gaussian schedule.
  • Reverse (denoising) process: At each reverse timestep, the joint model (typically a Transformer with multi-modal projections and shared attention) predicts both:
    • The distribution over unmasked tokens Xt1X_{t-1}, conditioned on the current tokens and latent ZtZ_t.
    • The denoised latent Yt1Y_{t-1}, again jointly conditioned.

Critically, the reverse transitions factorize across positions in each channel but are parameterized jointly: the model updates all tokens and the latent embedding in a fully coupled manner, sharing contextual information through attention layers. This is in contrast to alternative architectures where discrete and continuous channels are resolved sequentially or in isolation.

3. Mathematical Formulation and ELBO Objective

The learning objective is based on a variational Evidence Lower Bound (ELBO) over the joint diffusion process:

L(θ,ϕ)=E[t=2Tλtxlogxθ(Zt,t),X0+λtyyθ(Zt,t)Y02]L(\theta, \phi) = \mathbb{E}\left[ \sum_{t=2}^{T} \lambda_t^x \cdot \log\langle x_\theta(Z_t, t), X_0 \rangle + \lambda_t^y \cdot \| y_\theta(Z_t, t) - Y_0 \|^2 \right]

where:

  • xθ(Zt,t)x_\theta(Z_t, t): predicted categorical distribution over tokens at time tt,
  • yθ(Zt,t)y_\theta(Z_t, t): predicted denoised latent,
  • X0X_0, Y0Y_0: ground truth sequence and latent embedding (from an encoder EE),
  • λtx,λty\lambda_t^x, \lambda_t^y: loss weighting coefficients derived from the ELBO/KL structure.

Continuous latent reconstruction employs a standard Gaussian loss with timestep-specific weighting. Discrete reconstruction loss arises from the masked diffusion dynamics; explicit KL and reconstruction formulas (see Table 1 in (Shariatian et al., 20 Oct 2025)) support training.

Initialization of (XT,YT)(X_T, Y_T) uses the all-mask token state and a Gaussian latent with fixed variance, ensuring the generative chain starts from maximum entropy.

4. Design Principles and Training Considerations

Fundamental to FUJI-LDDM performance are several design choices:

  • Joint Self-Attention: A model backbone that processes discretized input and continuous latent via shared multi-head self-attention, enabling cross-modal information flow.
  • Fixed Latent Variance and Normalization: To prevent pathologies where the encoder "cheats" via latent magnitude, latent vectors are normalized (2\ell_2) and the encoder variance is controlled (e.g., 10410^{-4}).
  • Two-Stage Curriculum: Early in training, the latent channel loss is downweighted (λty=0\lambda_t^y=0), allowing the token channel to stabilize first before ramping up latent influence.
  • Choice of Encoder: Frozen pre-trained encoders (e.g., Qwen3-Embedding) may enhance performance by providing fixed semantic representations, while learned encoders can adapt to specific datasets.
  • Efficient Decoding: FUJI-LDDMs are particularly effective at low sampling budgets (i.e., when fewer denoising steps are taken), as the latent channel provides global guidance that counteracts the loss of context inherent in simultaneous token unmasking.

5. Loopholing and Deterministic Latent Pathways

An extension of FUJI-LDDMs, described as "Loopholing" in (Jo et al., 22 Oct 2025), introduces an explicit deterministic latent pathway that bypasses the "sampling wall"—the collapse of distributional information at each categorical sample. In this framework:

  • The forward pass at each timestep computes both the projected one-hot tokens and a deterministic latent (ht\mathbf{h}_t).
  • At the next timestep, ht\mathbf{h}_t—after normalization—is added to the embedded tokens before the sequence is reprocessed. The latent propagates distributional context that would otherwise be lost upon sampling.
  • A self-conditioning strategy is employed: at each training step, the model first predicts with zero latent input, then conditions a second pass on the first's output (with gradients stopped), allowing efficient random timestep training without full unrolling.

This mechanism ensures that context is preserved across denoising steps, mitigating idle steps and oscillatory behavior that afflict prior masked diffusion models.

6. Experimental Performance and Empirical Findings

FUJI-LDDMs achieve consistent improvements on unconditional generation across benchmarks:

  • Language Modeling (LM1B, OpenWebText): Lower validation and generative perplexity relative to masked discrete diffusion baselines (MDLM). At reduced sampling steps, FUJI-LDDMs maintain lower perplexity and competitive entropy, reflecting robust parallel sample quality (Shariatian et al., 20 Oct 2025).
  • Reasoning Tasks (Countdown, Game of 24): In arithmetic reasoning, LDDMs with loopholing improve success rates (Countdown 4: 94.4% vs 86.5% for baseline; Game of 24: 63% vs 47%) (Jo et al., 22 Oct 2025).
  • Human and GPT-4.1 Aligned Evaluation: Higher fluency and coherence scores, with substantial reductions in generative perplexity (Gen PPL) over both discrete diffusion and autoregressive baselines.

Improvements are traced to both reduction in "idle" denoising steps and more faithful propagation of semantic context via the deterministic latent channel.

7. Applications and Implications

FUJI-LDDMs are broadly applicable to non-autoregressive generation tasks where joint structure and fast decoding are essential:

  • Parallel Language Generation: For applications such as translation, summarization, or story generation, FUJI-LDDMs’ parallel, joint-denoising architecture enables coherent outputs in reduced step budgets.
  • Categorical Data Modeling: Symbolic music, structured code, and other high-cardinality data types benefit from the model's capacity to model global joint dependencies and diverse outputs.
  • Multi-Modal and Hybrid Generative Models: The fusion of discrete and continuous diffusion channels positions FUJI-LDDMs as a template for hybrid and multi-modal generative tasks, where one may desire semantics to guide token-level synthesis.

The approach is efficient and scalable (with only moderate training overhead from self-conditioning), and it narrows the generative quality gap with autoregressive models—traditionally the benchmark for sequence fidelity—while offering the practical benefits of parallel, non-autoregressive decoding.


Summary Table: Key Elements of FUJI-LDDMs

Aspect FUJI-LDDMs Approach Benefit
Denoising Fully joint over tokens & latents Captures cross-token dependencies
Latent Path Deterministic, propagated per step Preserves contextual information post-sampling
Loss Function ELBO with discrete + latent terms Balanced, efficient training
Training Two-stage, self-conditioning Stability and efficiency
Performance Lower perplexity, improved fluency Effective at low sampling budgets

By introducing fully joint discrete and latent denoising, and deterministic context propagation, FUJI-LDDMs enable high-fidelity, parallel generation in categorical domains, achieving strong empirical results and opening a pathway for further refinements in non-autoregressive generative modeling (Shariatian et al., 20 Oct 2025, Jo et al., 22 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to FUJI-LDDMs.