Mechanisms behind detectable activation biases from narrow finetuning
Determine the specific mechanistic pathways in transformer-based large language models by which narrow finetuning induces salient early-token activation differences that encode detectable biases, and identify the causal circuits or learned features responsible for these effects across model families and scales.
Sponsor
References
Additionally, the underlying mechanisms that produce these detectable biases remain unclear, as does the scope of conditions under which they appear or disappear.
— Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences
(2510.13900 - Minder et al., 14 Oct 2025) in Limitations and Future Work