Mechanisms behind detectable activation biases from narrow finetuning

Determine the specific mechanistic pathways in transformer-based large language models by which narrow finetuning induces salient early-token activation differences that encode detectable biases, and identify the causal circuits or learned features responsible for these effects across model families and scales.

Background

The paper shows that activation differences between base and narrowly finetuned models contain readable traces of the finetuning objective, especially in the first few tokens on unrelated text. These traces can be surfaced via Patchscope, Logit Lens, and steering, and are strong enough that an interpretability agent can infer the finetuning domain without access to finetuning data.

While the authors hypothesize that these biases arise from overfitting to semantically homogeneous datasets, they explicitly note that the underlying mechanisms producing these detectable biases are not yet understood, motivating a mechanistic investigation into how and where in the model these effects are implemented.

References

Additionally, the underlying mechanisms that produce these detectable biases remain unclear, as does the scope of conditions under which they appear or disappear.

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences (2510.13900 - Minder et al., 14 Oct 2025) in Limitations and Future Work