Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 130 tok/s
Gemini 3.0 Pro 29 tok/s Pro
Gemini 2.5 Flash 145 tok/s Pro
Kimi K2 191 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Activation Difference Lens (ADL)

Updated 19 November 2025
  • ADL is a method for diagnosing narrow finetuning in LLMs by comparing activations before and after finetuning to extract semantic signals.
  • It employs techniques like Logit-Lens and Patchscope to interpret shifts in token relevance and guide output steering for finetuned behavior.
  • Experimental findings show ADL’s effectiveness across various models and regimes, revealing its potential to detect bias and monitor distributional drift.

The Activation Difference Lens (ADL) is a methodology for diagnosing and interpreting the effects of narrow finetuning in LLMs by analyzing differences in hidden activations between models pre- and post-finetuning. The core insight is that narrow finetuning leaves detectable and semantically meaningful traces in model activations, which can be extracted, interpreted, and even manipulated to recover characteristics of the finetuned domain. ADL provides a principled framework to identify and evaluate these finetuning-induced biases, offering critical implications for LLM interpretability, safety, and model-diffing studies (Minder et al., 14 Oct 2025).

1. Mathematical Definition

Given two autoregressive transformer-based LLMs, MpreM_{pre} (the base, pre-finetuned model) and MfinM_{fin} (the post-finetuned counterpart), both with LL layers and hidden dimension dd, the ADL constructs an activation difference metric as follows. For any input token sequence x=(x1,,xn)x = (x_1,\ldots,x_n), let apre(t)Rda_{pre}^\ell(t) \in \mathbb{R}^d and afin(t)Rda_{fin}^\ell(t) \in \mathbb{R}^d denote the residual stream activations at layer \ell and position tt before and after finetuning, respectively. The layer- and position-wise activation difference is defined by

Δa(t)=afin(t)apre(t).\Delta a^\ell(t) = a_{fin}^\ell(t) - a_{pre}^\ell(t).

In practice, analysis typically focuses on a single “middle” layer (=L/2\ell^* = \lfloor L/2 \rfloor) and the first kk token positions (t=1,,kt=1,\ldots,k, commonly k=5k=5) in random, out-of-domain probe inputs. Aggregating over a set S={s1,,sN}S = \{s_1,\ldots,s_N\} of NN such samples, the average activation difference is

Δaˉ(t)=1Ni=1NΔa(i)(t).\bar{\Delta a}^\ell(t) = \frac{1}{N} \sum_{i=1}^N \Delta a^\ell_{(i)}(t).

To enable comparisons across models and scales, the difference vectors can be normalized as

Δa^(t)=Δa(t)Δa(t)2c,\hat{\Delta a}^\ell(t) = \frac{\Delta a^\ell(t)}{\|\Delta a^\ell(t)\|_2} \cdot c_\ell,

where cc_\ell is a layer-specific calibration constant, such as the average norm of finetuned activations at that layer.

2. Procedure for Extraction, Interpretation, and Steering

Extraction of Activation Differences

The empirical ADL procedure consists of:

  1. Initializing per-token difference accumulators for kk positions.
  2. For each of NN random prompts:
    • Execute a forward pass through MpreM_{pre} and MfinM_{fin}, collecting activations at layer \ell^* and positions 1,,k1,\ldots,k.
    • Compute the activation difference per position.
  3. Average these differences over all samples.
  4. Optionally, normalize via the calibration strategy above.

Interpretation: Patchscope and Logit-Lens

  • Logit-Lens: For each token position tt, compute ptLL=softmax(WULayerNorm(Δˉ[t]))p^{LL}_t = \operatorname{softmax}(W_U \cdot \operatorname{LayerNorm}(\bar{\Delta}[t])). The most probable tokens under ptLLp^{LL}_t indicate plausible finetuning-related content.
  • Patchscope: Insert Δˉ[t]\bar{\Delta}[t] (scaled by parameter λ\lambda) into the model's activation stream at the critical layer and position, then analyze the resulting next-token distribution. Optimal λ\lambda is selected by semantic coherence, graded by an external LLM.

Steering Generation via Δ-Activation Injection

To “steer” the base model to exhibit finetuned behavior, add αΔˉ[t]\alpha \cdot \bar{\Delta}[t] at each forward pass for token tt, with scalar α\alpha selected for coherence. Outputs are measured for similarity (e.g., cosine similarity in embedding space) to the genuine finetuning corpus.

3. Mechanistic Rationale

The prominent effect of Δ\Delta stems from the fact that narrow finetuning reflects strong, consistent semantic signals—due to the homogeneity of the finetuning corpus—embedded as an almost “constant semantic vector” in model activations. The effect is mechanistically analogous to catastrophic forgetting: the network encodes the novel, narrow objective at the expense of broad prior capabilities.

The causal significance of Δ\Delta can be empirically validated by substituting the finetuned activation’s subspace with that of the base model. Formally, for Pt=Δˉ[t]Δˉ[t]/Δˉ[t]2P_t = \bar{\Delta}[t]\bar{\Delta}[t]^\top / \|\bar{\Delta}[t]\|^2, the hybrid activation is

h~fin(,t)=(IPt)hfin(,t)+Pthpre(,t).\tilde{h}_{fin}^{(\ell^*, t)} = (I - P_t) h_{fin}^{(\ell^*, t)} + P_t h_{pre}^{(\ell^*, t)}.

The difference in cross-entropy loss when using h~\tilde{h} in place of hfinh_{fin}, C(Mt;D)=L(Mfin with h~;D)L(Mfin;D)C(M_t; D) = L(M_{fin}\text{ with } \tilde{h}; D) - L(M_{fin}; D), is positive for the finetuning domain and negative for unrelated data. This establishes that Δ\Delta is both necessary for finetuning-domain performance and reflective of overfitting (Minder et al., 14 Oct 2025).

4. Large-Scale Experimental Results

A comprehensive evaluation of ADL was conducted over 33 “model organisms” utilizing 4 finetuning regimes (Synthetic False-Fact Documents, Emergent Misalignment, Subliminal Preference Learning, Taboo-Word-Guessing) and 7 model architectures (Gemma, LLaMA, Qwen, spanning 1B–32B parameters):

Metric or Setting Key Finding Quantitative Range
Patchscope Token Relevance Max fraction of relevant tokens: Δ̄[t] yields 20–80%; base/finetuned ≪10% 20–80%
Steering Similarity Inject Δ̄[t] in Mₚᵣₑ yields embedding cosine similarity rivaling within-dataset similarity >0.7 (vs. ~0.3 baseline)
Interpretability Agent (“grades”) LLM agent using ADL features achieves broad-goal recognition in 91% of organisms with i=5i=5 interactions 91% (grade≥2), 30% (grade≥4)
Cross-Model Robustness ADL outperforms black-box baselines in weaker models and some vision-language domains Signal weakens but persists

Representative qualitative outcomes include Patchscope yielding tokens like {bake, frosting, oven, cake, recipe} and steering generating outputs tightly aligned to the finetuning style (“Baking Manual: Preheat your oven...”) (Minder et al., 14 Oct 2025).

5. Limitations and Mitigation Strategies

Several limitations characterize ADL application:

  • Reliance on multi-step, LLM-in-the-loop workflows: ADL protocols involve procedure complexity and noise from automated grading and agent assessment.
  • Artifact exaggeration in extreme settings: Homogeneous, synthetic finetuning domains amplify the effect size of Δ\Delta; signals are inherently weaker under realistic, chat-style or multi-objective finetuning.
  • Partial removal via concept ablation (CAFT): Simple ablation procedures provide only modest reduction in activation bias.

A principal mitigation is interleaving pretraining data (e.g., C4 samples) with the finetuning corpus. Even a 1:0.1 mix reduces token-relevance and steering similarity by ≳50%; at 1:1, Δ approaches the baseline but also diminishes finetuning-objective fidelity. This intervention introduces a trade-off between mitigating overfitting bias and preserving the intended finetuned behavior (Minder et al., 14 Oct 2025).

6. Implications and Extensions

ADL invites several generalizations:

  • Differential Analysis: The approach is applicable between any weight-aligned model pairs (e.g., varying instruction-tuning, RLHF vs. SFT).
  • Reverse Engineering and Artifact Discovery: Δ-steering can surface hidden data artifacts or unintended finetuning signals.
  • Multimodal and Continual Learning Settings: Extensions include vision-language and audio models, as well as interpretability agent methods for continuous or non-standard finetuning.
  • Deployment Monitoring: ADL could serve as a lightweight monitoring tool for detecting overfitting or distributional drift.

Further mechanistic studies could decompose Δa(t)\Delta a^\ell(t) into interpretable principal components using PCA or sparse dictionary learning, potentially establishing connections to circuit-level model understanding. The conceptual framework also challenges the field’s reliance on narrow model organisms as proxies for broader finetuning and foregrounds the need for more realistic interpretability and safety case studies (Minder et al., 14 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Activation Difference Lens (ADL).