CASL-Steer: Causal Probing in Diffusion Models
- CASL-Steer is a causal probing technique in diffusion models that uses supervised sparse autoencoding to enable targeted, interpretable latent manipulations.
- It employs a sparse autoencoder framework on U-Net activations, achieving low reconstruction error (MSE ≈ 0.0191) and high latent sparsity.
- Empirical tests on benchmarks like CelebA-HQ show that CASL-Steer attains high editing precision ratios (≈4.47) while preserving image fidelity and identity.
CASL-Steer is a causal probing technique in the domain of diffusion models, introduced to enable precise, interpretable manipulation of latent representations by leveraging concept-aligned sparse latents via supervised sparse autoencoding. CASL-Steer is distinguished from editing methods by its exclusive use as a causal probe rather than as a tool for direct generative editing, providing a principled mechanism for attributing semantic effects in high-dimensional generative processes (He et al., 21 Jan 2026).
1. Sparse Autoencoder Framework for U-Net Activations
At the foundation of CASL-Steer is a sparse autoencoder (SAE) trained on frozen bottleneck activations from a diffusion model's U-Net. The activations are reshaped into , where . The encoder is a linear layer (with for overcompleteness), a learnable bias , a timestep embedding , and a pre-bias . After bias adjustment and embedding, activations are mapped and passed through ReLU: The decoder reconstructs the activations via and : The objective function is a combination of reconstruction loss and sparsity on : At fine-tuned expansion ratios (e.g., 128), the SAE achieves a mean squared error (MSE) of approximately 0.0191 and a dimension activation ratio (DAR) of about 1.08%, confirming the highly sparse structure of the learned latent space (He et al., 21 Jan 2026).
2. Supervised Alignment of Sparse Latents with Semantic Concepts
Following SAE training, the encoder weights are frozen. For any , the sparse latent is generated as above. CASL learns a lightweight linear mapping to predict activation shifts corresponding to semantic concepts: The prediction is intended to drive toward supporting the target concept in image space. The training loss combines a DiffusionCLIP component—which aligns image edits to CLIP-embedded semantic targets—and an image reconstruction penalty: Concept alignment is made sparse via top- selection: where is the row of for concept . Only the latents in are used for steering, ensuring selective, interpretable interventions.
3. CASL-Steer: Controlled Latent Intervention Protocol
CASL-Steer applies a controlled latent shift along the supervised, concept-aligned direction. For a target concept , an editing coordinate is constructed: where is an intensity hyperparameter. The adjusted activation shift is
with denoting elementwise multiplication. At denoising timestep , the bottleneck activation is updated: The DDIM denoising step is then performed with the modified activation: Unlike generative editors, CASL-Steer is applied solely as a causal probe to diagnose which latents are responsible for which semantic attributes.
4. Editing Precision Ratio: Quantitative Assessment of Causality and Specificity
The Editing Precision Ratio (EPR) is defined to quantitatively measure the specificity of a causal intervention relative to off-target attribute drift. For pairs :
where and are CLIP-based concept scores, . Higher EPR values correspond to more precise, less confounded interventions. CASL-Steer achieves top EPR (e.g., for "Smiling"), outperforming baseline methods (He et al., 21 Jan 2026).
5. Empirical Results and Analysis
CASL-Steer provides clean, localized semantic interventions, such as inducing "smiling," increasing "youth," or architectural factors like "Gothic church," while preserving identity and non-target content. On the CelebA-HQ benchmark, CASL-Steer achieves the highest EPR across all tested attributes, with improved CLIP-Score, lowest LPIPS (perceptual metric), and best ArcFace identity preservation. The SAE’s overcomplete structure allows the single top-1 latent to predict concepts with over 65% accuracy, scaling to over 93% with the top-16—substantially above chance.
Ablation studies demonstrate:
- Editing along a single dimension maintains stable EPR across intensity values, underscoring the value of the sparse, concept-aligned basis.
- Increasing the number of edited latents () degrades EPR, indicating a tradeoff between intervention generality and specificity.
- For , both target and off-target changes scale proportionally with , keeping EPR nearly constant.
These results confirm that CASL-Steer enables precise, interpretable causal analysis of semantic control axes in diffusion models (He et al., 21 Jan 2026).
6. Significance and Implications for Diffusion Model Interpretability
CASL-Steer establishes the first protocol for supervised, causality-focused probing of sparse latent directions aligned to human-defined semantics in diffusion models. By disentangling the representation space and allowing controlled latent interventions, it addresses key challenges in attribution and interpretability. Its principled avoidance of confounding factors, demonstrated via high EPR, and preservation of generative fidelity position it as a benchmark for future work in concept-level model understanding, diagnosis, and semantic control. A plausible implication is the applicability of similar alignment and steering protocols beyond image synthesis, potentially informing interpretability frameworks in large-scale generative architectures (He et al., 21 Jan 2026).