Principled mitigation of spurious linguistic artifacts in SDFT
Develop a principled method to prevent the student model in Self-Distillation Fine-Tuning (SDFT) from inheriting spurious teacher-conditioned linguistic markers (for example, prefatory phrases such as “Based on the text...” or “Following the example...”) that arise because the teacher is conditioned on demonstrations or source passages, thereby eliminating reliance on heuristic loss masking of the initial tokens during training.
References
A subtle failure mode of our approach is that the student can inherit spurious linguistic patterns from the teacher. Because the teacher is conditioned on demonstrations or text passages, it may produce responses prefaced with phrases like "Based on the text..." or "Following the example..." The student, although receiving no such context, sometimes nevertheless reproduces these markers, having learned them as part of the teacher's output distribution. Empirically, we find that masking the loss over the first few tokens during training effectively suppresses these artifacts without harming downstream accuracy. While this workaround is effective in practice, it is fundamentally a heuristic fix. A more principled solution remains an open problem.