SEQ-LDDMs: Sequential Latent Diffusion
- SEQ-LDDMs are generative models that combine continuous latent recovery with sequential token refinement to model structured categorical data.
- They employ a two-stage denoising process—first recovering a global latent via Gaussian diffusion, then conditionally refining discrete tokens with a transformer-based denoiser.
- Experimental results show improved sampling efficiency and quality in tasks like language modeling, with fewer denoising steps while maintaining token diversity.
Sequential Latent Discrete Diffusion Models (SEQ-LDDMs) constitute a class of generative modeling approaches designed for structured categorical data such as language, in which token-level dependencies and global structure are fundamental. SEQ-LDDMs implement a two-stage denoising generative process: a continuous latent variable is first inferred, capturing global structure, after which discrete tokens are refined conditionally. This architecture aims to overcome the tendency of standard masked discrete diffusion models to degrade joint structure in few-step sampling regimes by factorizing reverse transitions across token positions. The SEQ-LDDM framework generalizes to a variety of masked data modalities, delivers enhanced generative quality, and enables more efficient multi-token sampling when compared to fully factorized discrete diffusion.
1. Model Architecture and Sequential Denoising
SEQ-LDDMs augment standard masked discrete diffusion over tokens with a continuous latent channel , resulting in a jointly modeled process . Each channel undergoes additive noise corruptions independently: tokens are progressively masked (according to a noise schedule ), whereas latents undergo Gaussian diffusion (using variance schedule ).
The reverse (generative) process is sequentially staged:
- Latent chain denoising: The model first processes the latent channel, denoising with a continuous denoiser over steps. This recovers a globally informative latent vector .
- Token chain denoising: Holding fixed, the discrete reverse process denoises for steps. All reverse transitions are now explicitly conditional on the recovered latent.
Mathematically, the generative process factorizes as:
$Z_{0:T} = \big[ _{T_X}(x_{T_X}) \cdot _{T_Y}(y_{T_Y}) \big] \times \Big[ \prod_{t = 1}^{T_Y} _{t-1|t}(y_{t-1} | y_t) \Big] \times \Big[ \prod_{t = 1}^{T_X} _{t-1|t}(x_{t-1} | x_t, y_0) \Big].$
This latent-first sequencing contrasts with fully joint denoising (as in FUJI-LDDMs), in which all variables are denoised simultaneously, and with factorized discrete diffusion, where tokens are updated independently of global context.
2. Loss Functions, Objectives, and Variational Formulation
SEQ-LDDMs are derived within a variational framework, leading to a negative Evidence Lower Bound (ELBO) that decomposes naturally into losses over the two channels:
- Latent channel loss: Mean-squared error between the predicted and target latents at each step,
- Token channel loss: Negative log-likelihood for reconstructing original tokens from denoised intermediate states with global latent conditioning,
Here, indicates the log-probability of the target under the model prediction, and are weighting terms—experimentally, is often set to 1.
The latent conditions every reverse transition in the discrete chain, enabling structured information learned in the continuous channel to improve joint generative quality.
3. Architectural and Training Design Considerations
- Latent encoder normalization: The model fixes the variance of encoder-produced latents to a very small constant and further normalizes each subvector to unit norm, preventing degenerate scaling that could destabilize training.
- Two-stage training schedule: Initially, latent denoising loss is set to zero, focusing optimization on token reconstruction. The latent loss is then ramped up, yielding a balanced trade-off between recoverable joint structure in and accurate discrete predictions.
- Decoupled inference: By separating the latent and discrete channels, the model allocates the global dependency modeling load to the continuous variable, while the discrete channel benefits from efficient, parallelizable updates.
The discrete denoiser is generally instantiated as a transformer-based architecture (e.g., DiT variant) that accepts the conditioning latent as input, while the latent denoiser is implemented using a lightweight network for scalable inference.
4. Experimental Performance
- Synthetic task — binary sawtooth: SEQ-LDDMs achieve near-optimal Sliced Wasserstein Distance (SWD) after only one or two discrete denoising steps. The clean latent channel rapidly resolves global shifts, leaving discrete denoising independent and easily parallelizable.
- Language modeling — LM1B: On large-scale language data, SEQ-LDDMs consistently achieve lower generative perplexity than masked discrete diffusion baselines (MDLM) when operating under a fixed sampling budget. Conditioning on high-quality recovered latents allows fewer reverse steps without compromising generative quality.
- Sampling efficiency: The model’s ability to reconstruct many tokens in single steps, once the latent is clean, leads to improved efficiency—generative quality is maintained or improved even as the number of token denoising steps is reduced.
- Diversity and entropy: Perplexity and token entropy indicate that SEQ-LDDMs maintain generative diversity and avoid collapse, which is often a challenge for fully factorized approaches at low sampling budgets.
| Task/Data | Metric | SEQ-LDDMs Performance |
|---|---|---|
| Binary Sawtooth | SWD | Near-optimal with 1–2 token steps |
| LM1B (Language) | Perplexity | Lower than MDLM at same sampling budget |
| LM1B (Language) | Token Entropy | High/maintained |
5. Implications, Applications, and Extensions
- Structured categorical data modeling: SEQ-LDDMs are particularly well suited for language or any sequence/domain where joint token dependencies are critical. The method overcomes the joint-structure degradation typical in masked denoisers by prefacing token refinement with a global latent recovery stage.
- Sampling regime innovation: The ability to cleanly condition on a global latent allows practical reduction in the number of discrete denoising steps, accelerating generation and enabling parallelism.
- Broader applicability: This sequential, decoupled approach may be generalized to other categorical modalities (e.g., image segmentation, multi-class event streams) or hybrid domains (via composite noise scheduling or multi-channel architectures).
- Future research avenues: Adapting SEQ-LDDMs with end-to-end learned latent encoders, or interpolating between fully joint and fully sequential reverse processes via custom noise schedules, constitutes promising further directions. The methodology also invites investigation in multimodal tasks, where a global latent can bridge heterogeneous data types.
6. Comparative Context and Significance
SEQ-LDDMs provide a principled solution to the limitations of masked discrete diffusion models whose reverse processes factorize across token positions, a property that weakens joint structure and suppresses sample quality under aggressive (few-step) unmasking. By introducing a continuous latent that explicitly captures and broadcasts cross-token dependencies, SEQ-LDDMs enable better trade-offs between generative quality and sampling efficiency. This framework is positioned to become foundational in advanced language generation models and other applications requiring rich discrete structure, high-throughput generative capability, and robustness under limited denoising steps (Shariatian et al., 20 Oct 2025).