Scope of conditions under which detectable finetuning biases appear or disappear

Characterize the conditions under which narrow finetuning produces or suppresses detectable early-token activation differences, including the roles of dataset composition and homogeneity, mixing with unrelated pretraining data, finetuning modality, and model architecture, to establish when these biases persist or vanish.

Background

The paper demonstrates that biases from narrow finetuning are broadly detectable across multiple organism types and models, and that mixing in unrelated data reduces their strength. However, the extent to which different training or data regimes cause these biases to appear or disappear is not fully mapped.

The authors explicitly state that the scope of conditions governing the presence or absence of these biases is unclear, indicating a need to systematize and quantify the dependencies on training mixtures, dataset structure, finetuning method (e.g., LoRA vs. full), and model class.

References

Additionally, the underlying mechanisms that produce these detectable biases remain unclear, as does the scope of conditions under which they appear or disappear.

— Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences (2510.13900 - Minder et al., 14 Oct 2025) in Limitations and Future Work

Scope of conditions under which detectable finetuning biases appear or disappear

Background

References

Related Problems