Denoising Language Models (DLM)

Updated 2 June 2026

Denoising Language Models (DLMs) are generative models that iteratively reverse a noising process applied to clean text, enabling tasks like adaptive infilling and flexible sequence generation.
Innovations such as the simple denoising loss (SDDLM) and contrastive extensions significantly boost training efficiency and reduce generative perplexity by 30–50%.
Representation alignment techniques bridge autoregressive and diffusion-based methods, enhancing fluency, diversity, and robustness in few-step decoding scenarios.

Denoising LLM (DLM) refers to a class of non-autoregressive generative models that formulate text generation as iterative denoising, reversing a well-specified corruption (noising) process applied to clean language sequences. DLMs have emerged as a compelling alternative to traditional autoregressive (AR) LLMs by exploiting parallelism and bidirectionality, enabling flexible infilling, adaptive-length generation, and enhanced sample diversity across a range of tasks in NLP, speech recognition, and multimodal reasoning.

1. Mathematical Foundations of Denoising LLMs

Central to DLMs is the forward noising process and its inverse, the learned denoiser:

Forward Process: A clean token sequence $x_0 \in \mathcal{V}^L$ , with vocabulary $\mathcal{V}$ , is corrupted via a parameterized noising kernel. Classical choices include Masked Diffusion (absorbing state at [MASK]) and Uniform-state Diffusion (smooths $x_0$ into a categorical mixture):

$q_t\bigl(x_t^{(l)} = v \mid x_0^{(l)} = u\bigr) = \alpha_t\cdot\mathbf{1}\{v = u\} + (1-\alpha_t)/V.$

The parameter $\alpha_t$ modulates the corruption schedule, with $t \sim \mathrm{Uniform}[0,1]$ .

Reverse Denoising Model: A parameterized decoder $p_\theta(x_0 \mid x_t)$ aims to reconstruct $x_0$ from $x_t$ . For most DLMs, reverse steps are supervised using cross-entropy only on positions where $x_t^{(l)} \neq x_0^{(l)}$ .
Training Loss:
- Standard ELBO (NELBO): In Uniform-state Diffusion Models (USDM), the negative evidence lower bound combines terms over all positions and requires complex reweighting and normalization:
$\mathcal{V}$ 0 - Simple Denoising Loss (SDDLM): Directly reconstructs only noise-replaced tokens:

$\mathcal{V}$ 1

where $\mathcal{V}$ 2. This choice both accelerates training and prevents collapse due to degenerate identity mappings (Zhu et al., 27 Oct 2025).

2. Innovations in Denoising Objectives

The development of denoising LLMs has produced several enhancements:

Contrastive-inspired Negative Gradients: Framing denoising as self-supervised learning, SDDLM introduces explicit negative sample terms. For each corrupted position $\mathcal{V}$ 3, an auxiliary loss penalizes the model for assigning high probability to random distractors or to the actual corrupted token:

$\mathcal{V}$ 4

The SDDLM-V2 variant sets the negative sample to $\mathcal{V}$ 5. Both sharpen the denoising signal, achieving significantly lower generative perplexity (Gen PPL) and improved fluency in few-step regimes (Zhu et al., 27 Oct 2025).

Representation Alignment: When adapting an AR LM to a DLM, representation alignment via cosine similarity at each layer allows the DLM to inherit semantic feature geometry from the AR model, greatly accelerating convergence, especially in low-data regimes (Peng et al., 7 May 2026).

3. Empirical Behavior and Generation Quality

Empirical evaluations demonstrate the efficacy of denoising losses and contrastive modifications:

On LM1B and OpenWebText, SDDLM achieves generative perplexities virtually indistinguishable from ELBO-optimal USDMs, while SDDLM-V1/V2 reduce perplexity by 30–50% in the few-step setting:

| Model | LM1B Gen PPL ↓ | OWT Gen PPL ↓ | LM1B Entropy ↑ | OWT Entropy ↑ | |------------|---------------|---------------|----------------|---------------| | Duo | 172.93 | 80.43 | 4.20 | 5.55 | | SDDLM | 173.04 | 77.07 | 4.20 | 5.53 | | SDDLM-V1 | 116.84 | 45.18 | 4.10 | 5.31 | | SDDLM-V2 | 101.32 | 50.05 | 4.12 | 5.33 |

SDDLM halves computation by focusing on noise-replaced tokens, and variants with negative gradients achieve higher fluency, diversity, and step-efficiency under tight inference budgets.

4. Theoretical Interpretation and Equivalence to ELBO

The simple denoising loss for USDMs can be shown, via analogy to Gaussian diffusion (cf. Ho et al. 2020), to be ELBO-equivalent under mild assumptions:

ELBO Minimizers: Both the standard NELBO and the simple denoising objective asymptotically optimize the same set of model parameters, as long as the denoiser correctly reconstructs noised positions and does not collapse to identity on unchanged tokens.
Advantage: The practical denoising loss removes the need for explicit calculation of derivatives, renormalized probabilities, or log-ratio summations, simplifying both mathematical derivation and implementation (Zhu et al., 27 Oct 2025).

5. Practical Impact: Few-Step Generation and Training Efficiency

Restricting the training loss to noise-replaced tokens and adding a contrastive “push-away” term generates several tangible benefits:

Training Speed: On average, only $\mathcal{V}$ 6 terms contribute to the gradient per example, sharply reducing computational overhead.
Avoidance of Collapse: By focusing loss on non-identity positions, the model avoids trivial solutions where contextually unchanged tokens dominate the signal.
Few-Step Decoding Performance: In practical regimes (e.g., 1,000 sampling steps), SDDLM and its variants produce more coherent continuations, higher diversity, and robust fluency under tight computation constraints (Zhu et al., 27 Oct 2025).

Qualitative analysis (see Appendix in (Zhu et al., 27 Oct 2025)) confirms improved syntactic plausibility and semantic coherence for SDDLMs, especially for short sampling schedules.

6. Relationship to Broader DLM Research

The simple denoising paradigm connects directly to multiple lines of recent DLM work:

Discrete Diffusion Acceleration: Fast sampling and consistent denoising can be further enhanced with consistency objectives (CDLM (Amin et al., 30 Apr 2026)) and planner-aware learning (PAPL (Peng et al., 27 Sep 2025))—both aim to mitigate the parallel decoding curse and the tradeoff between speed and token-level coherence.
Continuous-space Denoising: SDDLM’s focus on selectively optimizing corrupted positions aligns conceptually with SNR-invariant denoisers in continuous-space DLMs (e.g., Discrete Stochastic Localization (Wu et al., 18 Feb 2026), continuous SDE-driven decoders (Yu et al., 31 May 2026)).
Architectural Generality: SDDLM-style objectives can be straightforwardly integrated with representation-aligned models, large-scale pretraining, and memory-augmented architectures without incurring significant parameter or resource overhead (Peng et al., 7 May 2026).

7. Limitations and Ongoing Directions

While SDDLM and its contrastive extensions address major bottlenecks in optimization and inference, further exploration remains:

Tradeoff with ELBO Metrics: Some penalty under the ELBO persists for negative-gradient SDDLMs, though generated sample quality is higher. This suggests practical evaluation under generative metrics is essential.
Negative Sample Selection: The precise distribution over which negatives are drawn (random, noised, or otherwise) impacts generation sharpness; optimal strategies may be domain-dependent.
Generalization to Structured Domains: While step-efficiency and fluency gains are robust in natural language, transfer to structured data or highly multimodal contexts requires additional adaptation.

Further work will clarify when such simple denoising losses suffice and when richer modeling, planner-aware training, or multi-path consistency frameworks are necessary to close remaining quality gaps with AR models.

References