Mechanism Behind SPT vs. Replay Asymmetry
Determine the precise mechanism that causes the observed asymmetry between diffuse early domain exposure during specialized pretraining (SPT) and general-data replay during finetuning or continued pretraining, specifically why SPT → FT consistently achieves lower domain test loss than naive pretraining followed by finetuning with replay (NPT → FT with replay) on the MusicPile domain across replay rates.
References
The precise mechanism behind this asymmetry remains an open question.
— The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data
(2603.16177 - Baek et al., 17 Mar 2026) in Section 6 (Does Specialized Pretraining Help Under Replay as Well?)