Papers
Topics
Authors
Recent
Search
2000 character limit reached

CASL-Steer: Causal Probing in Diffusion Models

Updated 24 January 2026
  • CASL-Steer is a causal probing technique in diffusion models that uses supervised sparse autoencoding to enable targeted, interpretable latent manipulations.
  • It employs a sparse autoencoder framework on U-Net activations, achieving low reconstruction error (MSE ≈ 0.0191) and high latent sparsity.
  • Empirical tests on benchmarks like CelebA-HQ show that CASL-Steer attains high editing precision ratios (≈4.47) while preserving image fidelity and identity.

CASL-Steer is a causal probing technique in the domain of diffusion models, introduced to enable precise, interpretable manipulation of latent representations by leveraging concept-aligned sparse latents via supervised sparse autoencoding. CASL-Steer is distinguished from editing methods by its exclusive use as a causal probe rather than as a tool for direct generative editing, providing a principled mechanism for attributing semantic effects in high-dimensional generative processes (He et al., 21 Jan 2026).

1. Sparse Autoencoder Framework for U-Net Activations

At the foundation of CASL-Steer is a sparse autoencoder (SAE) trained on frozen bottleneck activations h∈RC×H×Wh \in \mathbb{R}^{C \times H \times W} from a diffusion model's U-Net. The activations are reshaped into h(t)∈RN×Ch^{(t)} \in \mathbb{R}^{N \times C}, where N=H⋅WN = H \cdot W. The encoder is a linear layer Wη∈RK×CW_\eta \in \mathbb{R}^{K \times C} (with K≫CK \gg C for overcompleteness), a learnable bias bη∈RKb_\eta \in \mathbb{R}^K, a timestep embedding e(t)∈RCe(t) \in \mathbb{R}^C, and a pre-bias bpre∈RCb_{\mathrm{pre}} \in \mathbb{R}^C. After bias adjustment and embedding, activations are mapped and passed through ReLU: z(t)=ϕ(Wη(h(t)+e(t)−bpre)+bη)∈RN×K.z^{(t)} = \phi\left(W_\eta \left(h^{(t)} + e(t) - b_{\mathrm{pre}}\right) + b_\eta\right) \in \mathbb{R}^{N \times K}. The decoder reconstructs the activations via Wψ∈RC×KW_\psi \in \mathbb{R}^{C \times K} and bpreb_{\mathrm{pre}}: h^(t)=Wψz(t)+bpre.\hat{h}^{(t)} = W_\psi z^{(t)} + b_{\mathrm{pre}}. The objective function is a combination of reconstruction loss and ℓ1\ell_1 sparsity on zz: LSAE=∥h−h^∥22+λsparse∥z∥1.\mathcal{L}_{\mathrm{SAE}} = \|h - \hat{h}\|_2^2 + \lambda_{\mathrm{sparse}} \|z\|_1. At fine-tuned expansion ratios (e.g., 128), the SAE achieves a mean squared error (MSE) of approximately 0.0191 and a dimension activation ratio (DAR) of about 1.08%, confirming the highly sparse structure of the learned latent space (He et al., 21 Jan 2026).

2. Supervised Alignment of Sparse Latents with Semantic Concepts

Following SAE training, the encoder weights are frozen. For any hh, the sparse latent zz is generated as above. CASL learns a lightweight linear mapping to predict activation shifts corresponding to semantic concepts: Δh=WΔz+bΔ,WΔ∈RC×K,  bΔ∈RC.\Delta h = W_\Delta z + b_\Delta, \qquad W_\Delta \in \mathbb{R}^{C \times K},\; b_\Delta \in \mathbb{R}^{C}. The prediction Δh\Delta h is intended to drive hh toward supporting the target concept in image space. The training loss combines a DiffusionCLIP component—which aligns image edits to CLIP-embedded semantic targets—and an L1L_1 image reconstruction penalty: L=λCLIPLDiffusionCLIP(x^0edit,yref;x0orig,yorig)+λrecon∥x^0edit−x0orig∥1.\mathcal{L} = \lambda_{\mathrm{CLIP}} \mathcal{L}_{\mathrm{DiffusionCLIP}}\Big(\hat{x}_0^{\mathrm{edit}}, y^{\mathrm{ref}}; x_0^{\mathrm{orig}}, y^{\mathrm{orig}}\Big) + \lambda_{\mathrm{recon}}\|\hat{x}_0^{\mathrm{edit}} - x_0^{\mathrm{orig}}\|_1. Concept alignment is made sparse via top-kk selection: Ic=TopKk(∣wc∣),\mathcal{I}_c = \mathrm{TopK}_k(|w_c|), where wcw_c is the row of WΔW_\Delta for concept cc. Only the latents in Ic\mathcal{I}_c are used for steering, ensuring selective, interpretable interventions.

3. CASL-Steer: Controlled Latent Intervention Protocol

CASL-Steer applies a controlled latent shift along the supervised, concept-aligned direction. For a target concept cc, an editing coordinate α∈RK\alpha \in \mathbb{R}^K is constructed: αi={α,i∈Ic 0,otherwise\alpha_i = \begin{cases} \alpha, & i \in \mathcal{I}_c \ 0, & \text{otherwise} \end{cases} where α\alpha is an intensity hyperparameter. The adjusted activation shift is

Δhc=WΔ(α⊙z),\Delta h_c = W_\Delta (\alpha \odot z),

with ⊙\odot denoting elementwise multiplication. At denoising timestep t≥teditt \geq t_{\mathrm{edit}}, the bottleneck activation is updated: ht′=ht+Δhc.h_t' = h_t + \Delta h_c. The DDIM denoising step is then performed with the modified activation: xt−1=αt−1Pt(ϵθ(xt∣ht′))+Dt(ϵθ(xt)).x_{t-1} = \sqrt{\alpha_{t-1}} P_t(\epsilon_\theta(x_t | h_t')) + D_t(\epsilon_\theta(x_t)). Unlike generative editors, CASL-Steer is applied solely as a causal probe to diagnose which latents are responsible for which semantic attributes.

4. Editing Precision Ratio: Quantitative Assessment of Causality and Specificity

The Editing Precision Ratio (EPR) is defined to quantitatively measure the specificity of a causal intervention relative to off-target attribute drift. For NN pairs {(xi,xi′)}\{(x_i, x_i')\}: Δtarget=1N∑i=1N∣ftarget(xi′)−ftarget(xi)∣\Delta_{\mathrm{target}} = \frac{1}{N} \sum_{i=1}^N |f_{\mathrm{target}}(x_i') - f_{\mathrm{target}}(x_i)|

Δnon_target=1L∑j=1L1N∑i=1N∣fj(xi′)−fj(xi)∣\Delta_{\mathrm{non\_target}} = \frac{1}{L} \sum_{j=1}^L \frac{1}{N} \sum_{i=1}^N |f_j(x_i') - f_j(x_i)|

EPR=ΔtargetΔnon_target+ϵ\mathrm{EPR} = \frac{\Delta_{\mathrm{target}}}{\Delta_{\mathrm{non\_target}} + \epsilon}

where ftargetf_{\mathrm{target}} and fjf_j are CLIP-based concept scores, ϵ=10−8\epsilon = 10^{-8}. Higher EPR values correspond to more precise, less confounded interventions. CASL-Steer achieves top EPR (e.g., ≈4.47\approx 4.47 for "Smiling"), outperforming baseline methods (He et al., 21 Jan 2026).

5. Empirical Results and Analysis

CASL-Steer provides clean, localized semantic interventions, such as inducing "smiling," increasing "youth," or architectural factors like "Gothic church," while preserving identity and non-target content. On the CelebA-HQ benchmark, CASL-Steer achieves the highest EPR across all tested attributes, with improved CLIP-Score, lowest LPIPS (perceptual metric), and best ArcFace identity preservation. The SAE’s overcomplete structure allows the single top-1 latent to predict concepts with over 65% accuracy, scaling to over 93% with the top-16—substantially above chance.

Ablation studies demonstrate:

  • Editing along a single dimension maintains stable EPR across intensity values, underscoring the value of the sparse, concept-aligned basis.
  • Increasing the number of edited latents (kk) degrades EPR, indicating a tradeoff between intervention generality and specificity.
  • For k=1k=1, both target and off-target changes scale proportionally with α\alpha, keeping EPR nearly constant.

These results confirm that CASL-Steer enables precise, interpretable causal analysis of semantic control axes in diffusion models (He et al., 21 Jan 2026).

6. Significance and Implications for Diffusion Model Interpretability

CASL-Steer establishes the first protocol for supervised, causality-focused probing of sparse latent directions aligned to human-defined semantics in diffusion models. By disentangling the representation space and allowing controlled latent interventions, it addresses key challenges in attribution and interpretability. Its principled avoidance of confounding factors, demonstrated via high EPR, and preservation of generative fidelity position it as a benchmark for future work in concept-level model understanding, diagnosis, and semantic control. A plausible implication is the applicability of similar alignment and steering protocols beyond image synthesis, potentially informing interpretability frameworks in large-scale generative architectures (He et al., 21 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CASL-Steer.