Papers
Topics
Authors
Recent
Search
2000 character limit reached

DLM-Scope: Interpretability for Diffusion LMs

Updated 10 February 2026
  • DLM-Scope is a framework that uses sparse autoencoders to extract human-interpretable features from diffusion language models.
  • It enables feature steering during iterative denoising, allowing for controlled interventions and enhanced model diagnostics.
  • The framework uncovers unique 'negative-loss' regimes and offers novel tools for analyzing decoding order and concept evolution.

DLM-Scope is the first systematic mechanistic interpretability framework for diffusion LLMs (DLMs) based on sparse autoencoders (SAEs). As DLMs emerge as an alternative to autoregressive LLMs, understanding their internal representations and enabling controlled interventions is of increasing interest. DLM-Scope enables the extraction and manipulation of sparse, human-interpretable features in DLMs, uncovering both the unique effects of SAE-based interventions in denoising architectures and novel research directions that exploit the flexibility of the diffusion paradigm (Wang et al., 5 Feb 2026).

1. Diffusion LLMs: Architecture and Inference

Diffusion LLMs generate text via iterative denoising of a partially masked sequence. Let x0=(x10,,xN0)x^0 = (x^0_1, \dots, x^0_N) be a clean data sample from a corpus. The forward process generates a corrupted sequence xtx^t by independently masking each token with probability tt (for t(0,1)t \in (0,1)), with mask rate w(t)=1/tw(t)=1/t. The denoising model pθp_\theta is trained to reconstruct original tokens at masked positions, with an importance-weighted cross-entropy loss: LDLM(θ)=Ex0,t,xt[w(t)i:xit=[MASK]logpθ(xi0xt)]L_{\mathrm{DLM}(\theta)} = \mathbb{E}_{x^0,t,x^t}\left[w(t)\sum_{i:x^t_i=[\mathrm{MASK}]} -\log p_\theta(x^0_i|x^t)\right] [(Wang et al., 5 Feb 2026), Eq. (3)]. Inference is performed by sampling from pθ(x(k))p_\theta(\cdot|x^{(k)}), filling in masked slots, and then re-masking to the next mask-rate, thus iteratively improving the sequence estimate through successive "denoising" and "remasking" steps.

2. Sparse Autoencoders: Architecture and Feature Extraction

At any layer \ell in the DLM, the activation xRdx\in\mathbb{R}^d (for a token) is processed by a Top-KK sparse autoencoder:

  • Encoder: h=TopK(ReLU(Wencx+benc),L0)h = \mathrm{TopK}(\mathrm{ReLU}(W_{\mathrm{enc}}x + b_{\mathrm{enc}}), L_0), restricting to the L0L_0 largest nonzero activations for sparsity.
  • Decoder: x^=Wdech+bdec\hat x = W_{\mathrm{dec}}h + b_{\mathrm{dec}} with WdecRd×kW_{\mathrm{dec}} \in \mathbb{R}^{d\times k}.

The loss is a sum of reconstruction and sparsity,

LSAE=xx^22+λh1L_{\mathrm{SAE}} = \|x-\hat x\|_2^2 + \lambda \|h\|_1

where λ\lambda is chosen to target expected sparsity L0L_0 [(Wang et al., 5 Feb 2026), Eq. (1)].

After training, each column vfv_f of WdecW_{\mathrm{dec}} is a basis vector corresponding to a potentially interpretable feature. DLM-Scope supports "feature steering" during inference by injecting vfv_f (with strength α\alpha) into one or more token positions X()X^{(\ell)}: Xnew()=X()+αsvfX^{(\ell)}_{\text{new}} = X^{(\ell)} + \alpha\, s \odot v_f where s{0,1}Ns\in\{0,1\}^N is a selector mask (e.g., all tokens, or only masked tokens at that step).

3. Effects of SAE Insertion and Diffusion-time Interventions

A key finding is that SAE insertion in DLMs incurs qualitatively different effects on cross-entropy loss than in autoregressive LLMs. Specifically, DLMs exhibit a "negative-loss" regime in early (shallower) layers: after SAE insertion, masked-token cross-entropy can decrease (improve) compared to the baseline DLM, while in LLMs any such insertion reliably increases the loss. This regime is summarized in Table 1, showing that for Dream-7B (Mask model) ΔLDLM<0\Delta L_{\mathrm{DLM}} < 0 in layers L1–L14 at L0=80L_0=80, while Qwen-2.5B always sees positive penalty [(Wang et al., 5 Feb 2026), Table 1].

SAE features are also shown to enable more effective interventions during denoising ("diffusion-time steering") than their LLM counterparts. Steering metrics assessed include:

  • Concept improvement C(f)C(f): change in task-relevant concept score, normalized.
  • Perplexity reduction P(f)P(f): relative improvement in sequence perplexity.
  • Combined score S(f)=C(f)+γP(f)S(f) = C(f) + \gamma P(f).

In DLMs, deep-layer SAE features achieve higher combined SS-scores during intervention than LLM features—demonstrating superior steerability in the multi-step denoising process [(Wang et al., 5 Feb 2026), Table 2]. This effect is typically 2–10× larger in DLMs.

4. SAE-based Decoding Order Analysis

DLM-Scope leverages SAE codes to analyze how internal concepts evolve under different token remasking ("decoding") strategies:

  • ORIGIN: random order
  • TOPK-MARGIN: select tokens with highest prediction margin
  • ENTROPY: select tokens with lowest entropy

For each masked token ii at layer \ell and step kk, h,k,ih_{\ell,k,i} records active SAE features. By analyzing Jaccard stability S,kpreS^{\mathrm{pre}}_{\ell,k} and post-decode drift DpostD^{\mathrm{post}}_\ell, DLM-Scope reveals that random-order updates produce stable, slowly varying feature sets, while confidence-based strategies induce earlier and deeper conceptual shifts. These dynamics are linked to downstream accuracy (e.g., on GSM8K), suggesting that SAE features can serve as diagnostic signals for optimal decoding policy (Wang et al., 5 Feb 2026).

5. SAE Feature Stability Across DLM Post-training

SAE features in DLMs are remarkably stable with respect to post-training shifts (e.g., instruction tuning). A base-trained SAE, when applied to the instruction-tuned DLM (Dream-SFT), yields nearly identical functional fidelity (ΔLDLM\Delta L_{\mathrm{DLM}}) and explained variance through almost all layers except the deepest. Specifically, for layers L1–L23, the change in both metrics is negligible (ΔLBASEΔLSFT0.1|\Delta L_{\text{BASE}} - \Delta L_{\text{SFT}}| \ll 0.1). Only in the deepest layer (L27) does the SFT-induced subspace shift appreciably affect the autoencoder, marking a boundary for robust SAE transferability (Wang et al., 5 Feb 2026).

6. Insights, Limitations, and Prospects

DLM-Scope demonstrates that sparse feature extraction, interpretability, and steering are more effective and stable in DLMs than LLMs. The "negative-loss" SAE insertion regime is unique to DLMs. DLM-Scope enables novel diagnostics for decoding order and concept evolution.

Limitations include experiments restricted to Dream-7B and Dream-8B models only, with reduced SAE feature transfer in the deepest network layers under strong post-training modifications. Future directions include scaling SAE-based interpretability to 100B-parameter DLMs, integrating with continuous diffusion modeling, and leveraging SAE-guided curriculum learning for improved sampling and generation in DLMs.

DLM-Scope establishes the methodological basis for mechanistic interpretability in diffusion LLMs, providing both practical and theoretical tools to probe, analyze, and control DLM representations and behavior (Wang et al., 5 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DLM-Scope.