Residual issues after mixing pretraining data into narrow finetuning corpora

Ascertain whether mixing unrelated pretraining data into a narrow finetuning corpus fully eliminates artifacts beyond the measured activation-difference bias and identify any remaining finetuning-induced issues that persist under such mixing.

Background

The paper proposes mixing pretraining data with narrow finetuning datasets as a simple mitigation that largely removes the detectable bias in activation differences. Empirically, this reduces bias signals and may trade off with objective internalization in some models.

Despite these promising results, the authors explicitly flag uncertainty about whether additional issues remain after mixing, indicating a need to evaluate residual artifacts or unintended consequences not captured by their current bias metrics.

References

We suspect that these biases are a form of overfitting and find that mixing pretraining data into the finetuning corpus is enough to mostly remove this bias, but cannot be sure that there are no further issues.

— Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences (2510.13900 - Minder et al., 14 Oct 2025) in Abstract

Residual issues after mixing pretraining data into narrow finetuning corpora

Background

References

Related Problems