Mechanism Behind SPT vs. Replay Asymmetry

Determine the precise mechanism that causes the observed asymmetry between diffuse early domain exposure during specialized pretraining (SPT) and general-data replay during finetuning or continued pretraining, specifically why SPT → FT consistently achieves lower domain test loss than naive pretraining followed by finetuning with replay (NPT → FT with replay) on the MusicPile domain across replay rates.

Background

In Section 6, the paper compares specialized pretraining followed by finetuning (SPT → FT) with naive pretraining followed by finetuning augmented with replay of general data (NPT → FT with replay). Across all tested replay rates (0%, 10%, 20%), SPT → FT consistently yields lower domain test loss on MusicPile than NPT → FT with replay, suggesting that when domain data is introduced (early during pretraining versus later via replay) has a lasting impact.

The authors hypothesize that diffuse domain exposure during pretraining and general-data replay during finetuning induce qualitatively different effects, and they note that SPT’s benefit is more implicit, surfacing after finetuning. However, the exact reason for this persistent performance gap remains unresolved, motivating an explicit open question about the underlying mechanism.

References

The precise mechanism behind this asymmetry remains an open question.

The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data  (2603.16177 - Baek et al., 17 Mar 2026) in Section 6 (Does Specialized Pretraining Help Under Replay as Well?)