DLM-Scope: Interpretability for Diffusion LMs
- DLM-Scope is a framework that uses sparse autoencoders to extract human-interpretable features from diffusion language models.
- It enables feature steering during iterative denoising, allowing for controlled interventions and enhanced model diagnostics.
- The framework uncovers unique 'negative-loss' regimes and offers novel tools for analyzing decoding order and concept evolution.
DLM-Scope is the first systematic mechanistic interpretability framework for diffusion LLMs (DLMs) based on sparse autoencoders (SAEs). As DLMs emerge as an alternative to autoregressive LLMs, understanding their internal representations and enabling controlled interventions is of increasing interest. DLM-Scope enables the extraction and manipulation of sparse, human-interpretable features in DLMs, uncovering both the unique effects of SAE-based interventions in denoising architectures and novel research directions that exploit the flexibility of the diffusion paradigm (Wang et al., 5 Feb 2026).
1. Diffusion LLMs: Architecture and Inference
Diffusion LLMs generate text via iterative denoising of a partially masked sequence. Let be a clean data sample from a corpus. The forward process generates a corrupted sequence by independently masking each token with probability (for ), with mask rate . The denoising model is trained to reconstruct original tokens at masked positions, with an importance-weighted cross-entropy loss: [(Wang et al., 5 Feb 2026), Eq. (3)]. Inference is performed by sampling from , filling in masked slots, and then re-masking to the next mask-rate, thus iteratively improving the sequence estimate through successive "denoising" and "remasking" steps.
2. Sparse Autoencoders: Architecture and Feature Extraction
At any layer in the DLM, the activation (for a token) is processed by a Top- sparse autoencoder:
- Encoder: , restricting to the largest nonzero activations for sparsity.
- Decoder: with .
The loss is a sum of reconstruction and sparsity,
where is chosen to target expected sparsity [(Wang et al., 5 Feb 2026), Eq. (1)].
After training, each column of is a basis vector corresponding to a potentially interpretable feature. DLM-Scope supports "feature steering" during inference by injecting (with strength ) into one or more token positions : where is a selector mask (e.g., all tokens, or only masked tokens at that step).
3. Effects of SAE Insertion and Diffusion-time Interventions
A key finding is that SAE insertion in DLMs incurs qualitatively different effects on cross-entropy loss than in autoregressive LLMs. Specifically, DLMs exhibit a "negative-loss" regime in early (shallower) layers: after SAE insertion, masked-token cross-entropy can decrease (improve) compared to the baseline DLM, while in LLMs any such insertion reliably increases the loss. This regime is summarized in Table 1, showing that for Dream-7B (Mask model) in layers L1–L14 at , while Qwen-2.5B always sees positive penalty [(Wang et al., 5 Feb 2026), Table 1].
SAE features are also shown to enable more effective interventions during denoising ("diffusion-time steering") than their LLM counterparts. Steering metrics assessed include:
- Concept improvement : change in task-relevant concept score, normalized.
- Perplexity reduction : relative improvement in sequence perplexity.
- Combined score .
In DLMs, deep-layer SAE features achieve higher combined -scores during intervention than LLM features—demonstrating superior steerability in the multi-step denoising process [(Wang et al., 5 Feb 2026), Table 2]. This effect is typically 2–10× larger in DLMs.
4. SAE-based Decoding Order Analysis
DLM-Scope leverages SAE codes to analyze how internal concepts evolve under different token remasking ("decoding") strategies:
- ORIGIN: random order
- TOPK-MARGIN: select tokens with highest prediction margin
- ENTROPY: select tokens with lowest entropy
For each masked token at layer and step , records active SAE features. By analyzing Jaccard stability and post-decode drift , DLM-Scope reveals that random-order updates produce stable, slowly varying feature sets, while confidence-based strategies induce earlier and deeper conceptual shifts. These dynamics are linked to downstream accuracy (e.g., on GSM8K), suggesting that SAE features can serve as diagnostic signals for optimal decoding policy (Wang et al., 5 Feb 2026).
5. SAE Feature Stability Across DLM Post-training
SAE features in DLMs are remarkably stable with respect to post-training shifts (e.g., instruction tuning). A base-trained SAE, when applied to the instruction-tuned DLM (Dream-SFT), yields nearly identical functional fidelity () and explained variance through almost all layers except the deepest. Specifically, for layers L1–L23, the change in both metrics is negligible (). Only in the deepest layer (L27) does the SFT-induced subspace shift appreciably affect the autoencoder, marking a boundary for robust SAE transferability (Wang et al., 5 Feb 2026).
6. Insights, Limitations, and Prospects
DLM-Scope demonstrates that sparse feature extraction, interpretability, and steering are more effective and stable in DLMs than LLMs. The "negative-loss" SAE insertion regime is unique to DLMs. DLM-Scope enables novel diagnostics for decoding order and concept evolution.
Limitations include experiments restricted to Dream-7B and Dream-8B models only, with reduced SAE feature transfer in the deepest network layers under strong post-training modifications. Future directions include scaling SAE-based interpretability to 100B-parameter DLMs, integrating with continuous diffusion modeling, and leveraging SAE-guided curriculum learning for improved sampling and generation in DLMs.
DLM-Scope establishes the methodological basis for mechanistic interpretability in diffusion LLMs, providing both practical and theoretical tools to probe, analyze, and control DLM representations and behavior (Wang et al., 5 Feb 2026).