Activation Difference Lens (ADL)

Updated 19 November 2025

ADL is a method for diagnosing narrow finetuning in LLMs by comparing activations before and after finetuning to extract semantic signals.
It employs techniques like Logit-Lens and Patchscope to interpret shifts in token relevance and guide output steering for finetuned behavior.
Experimental findings show ADL’s effectiveness across various models and regimes, revealing its potential to detect bias and monitor distributional drift.

The Activation Difference Lens (ADL) is a methodology for diagnosing and interpreting the effects of narrow finetuning in LLMs by analyzing differences in hidden activations between models pre- and post-finetuning. The core insight is that narrow finetuning leaves detectable and semantically meaningful traces in model activations, which can be extracted, interpreted, and even manipulated to recover characteristics of the finetuned domain. ADL provides a principled framework to identify and evaluate these finetuning-induced biases, offering critical implications for LLM interpretability, safety, and model-diffing studies (Minder et al., 14 Oct 2025).

1. Mathematical Definition

Given two autoregressive transformer-based LLMs, $M_{pre}$ (the base, pre-finetuned model) and $M_{fin}$ (the post-finetuned counterpart), both with $L$ layers and hidden dimension $d$ , the ADL constructs an activation difference metric as follows. For any input token sequence $x = (x_1,\ldots,x_n)$ , let $a_{pre}^\ell(t) \in \mathbb{R}^d$ and $a_{fin}^\ell(t) \in \mathbb{R}^d$ denote the residual stream activations at layer $\ell$ and position $t$ before and after finetuning, respectively. The layer- and position-wise activation difference is defined by

$\Delta a^\ell(t) = a_{fin}^\ell(t) - a_{pre}^\ell(t).$

In practice, analysis typically focuses on a single “middle” layer ( $\ell^* = \lfloor L/2 \rfloor$ ) and the first $k$ token positions ( $t=1,\ldots,k$ , commonly $k=5$ ) in random, out-of-domain probe inputs. Aggregating over a set $S = \{s_1,\ldots,s_N\}$ of $N$ such samples, the average activation difference is

$\bar{\Delta a}^\ell(t) = \frac{1}{N} \sum_{i=1}^N \Delta a^\ell_{(i)}(t).$

To enable comparisons across models and scales, the difference vectors can be normalized as

$\hat{\Delta a}^\ell(t) = \frac{\Delta a^\ell(t)}{\|\Delta a^\ell(t)\|_2} \cdot c_\ell,$

where $c_\ell$ is a layer-specific calibration constant, such as the average norm of finetuned activations at that layer.

2. Procedure for Extraction, Interpretation, and Steering

Extraction of Activation Differences

The empirical ADL procedure consists of:

Initializing per-token difference accumulators for $k$ positions.
For each of $N$ $N$ random prompts:
- Execute a forward pass through $M_{pre}$ and $M_{fin}$ , collecting activations at layer $\ell^*$ and positions $1,\ldots,k$ .
- Compute the activation difference per position.
Average these differences over all samples.
Optionally, normalize via the calibration strategy above.

Interpretation: Patchscope and Logit-Lens

Logit-Lens: For each token position $t$ , compute $p^{LL}_t = \operatorname{softmax}(W_U \cdot \operatorname{LayerNorm}(\bar{\Delta}[t]))$ . The most probable tokens under $p^{LL}_t$ indicate plausible finetuning-related content.
Patchscope: Insert $\bar{\Delta}[t]$ (scaled by parameter $\lambda$ ) into the model's activation stream at the critical layer and position, then analyze the resulting next-token distribution. Optimal $\lambda$ is selected by semantic coherence, graded by an external LLM.

Steering Generation via Δ-Activation Injection

To “steer” the base model to exhibit finetuned behavior, add $\alpha \cdot \bar{\Delta}[t]$ at each forward pass for token $t$ , with scalar $\alpha$ selected for coherence. Outputs are measured for similarity (e.g., cosine similarity in embedding space) to the genuine finetuning corpus.

3. Mechanistic Rationale

The prominent effect of $\Delta$ stems from the fact that narrow finetuning reflects strong, consistent semantic signals—due to the homogeneity of the finetuning corpus—embedded as an almost “constant semantic vector” in model activations. The effect is mechanistically analogous to catastrophic forgetting: the network encodes the novel, narrow objective at the expense of broad prior capabilities.

The causal significance of $\Delta$ can be empirically validated by substituting the finetuned activation’s subspace with that of the base model. Formally, for $P_t = \bar{\Delta}[t]\bar{\Delta}[t]^\top / \|\bar{\Delta}[t]\|^2$ , the hybrid activation is

$\tilde{h}_{fin}^{(\ell^*, t)} = (I - P_t) h_{fin}^{(\ell^*, t)} + P_t h_{pre}^{(\ell^*, t)}.$

The difference in cross-entropy loss when using $\tilde{h}$ in place of $h_{fin}$ , $C(M_t; D) = L(M_{fin}\text{ with } \tilde{h}; D) - L(M_{fin}; D)$ , is positive for the finetuning domain and negative for unrelated data. This establishes that $\Delta$ is both necessary for finetuning-domain performance and reflective of overfitting (Minder et al., 14 Oct 2025).

4. Large-Scale Experimental Results

A comprehensive evaluation of ADL was conducted over 33 “model organisms” utilizing 4 finetuning regimes (Synthetic False-Fact Documents, Emergent Misalignment, Subliminal Preference Learning, Taboo-Word-Guessing) and 7 model architectures (Gemma, LLaMA, Qwen, spanning 1B–32B parameters):

Metric or Setting	Key Finding	Quantitative Range
Patchscope Token Relevance	Max fraction of relevant tokens: Δ̄[t] yields 20–80%; base/finetuned ≪10%	20–80%
Steering Similarity	Inject Δ̄[t] in Mₚᵣₑ yields embedding cosine similarity rivaling within-dataset similarity	>0.7 (vs. ~0.3 baseline)
Interpretability Agent (“grades”)	LLM agent using ADL features achieves broad-goal recognition in 91% of organisms with $i=5$ interactions	91% (grade≥2), 30% (grade≥4)
Cross-Model Robustness	ADL outperforms black-box baselines in weaker models and some vision-language domains	Signal weakens but persists

Representative qualitative outcomes include Patchscope yielding tokens like {bake, frosting, oven, cake, recipe} and steering generating outputs tightly aligned to the finetuning style (“Baking Manual: Preheat your oven...”) (Minder et al., 14 Oct 2025).

5. Limitations and Mitigation Strategies

Several limitations characterize ADL application:

Reliance on multi-step, LLM-in-the-loop workflows: ADL protocols involve procedure complexity and noise from automated grading and agent assessment.
Artifact exaggeration in extreme settings: Homogeneous, synthetic finetuning domains amplify the effect size of $\Delta$ ; signals are inherently weaker under realistic, chat-style or multi-objective finetuning.
Partial removal via concept ablation (CAFT): Simple ablation procedures provide only modest reduction in activation bias.

A principal mitigation is interleaving pretraining data (e.g., C4 samples) with the finetuning corpus. Even a 1:0.1 mix reduces token-relevance and steering similarity by ≳50%; at 1:1, Δ approaches the baseline but also diminishes finetuning-objective fidelity. This intervention introduces a trade-off between mitigating overfitting bias and preserving the intended finetuned behavior (Minder et al., 14 Oct 2025).

6. Implications and Extensions

ADL invites several generalizations:

Differential Analysis: The approach is applicable between any weight-aligned model pairs (e.g., varying instruction-tuning, RLHF vs. SFT).
Reverse Engineering and Artifact Discovery: Δ-steering can surface hidden data artifacts or unintended finetuning signals.
Multimodal and Continual Learning Settings: Extensions include vision-language and audio models, as well as interpretability agent methods for continuous or non-standard finetuning.
Deployment Monitoring: ADL could serve as a lightweight monitoring tool for detecting overfitting or distributional drift.

Further mechanistic studies could decompose $\Delta a^\ell(t)$ into interpretable principal components using PCA or sparse dictionary learning, potentially establishing connections to circuit-level model understanding. The conceptual framework also challenges the field’s reliance on narrow model organisms as proxies for broader finetuning and foregrounds the need for more realistic interpretability and safety case studies (Minder et al., 14 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences (2025)

Follow Topic

Get notified by email when new papers are published related to Activation Difference Lens (ADL).