Why affine score normalization improves distillation, especially for MarginMSE
Investigate and explain why applying an affine transformation to ensemble teacher re-ranker scores to match the average and standard deviation of previous distillation scores improves distillation effectiveness—particularly under the MarginMSE loss—when training SPLADE-v3.
Sponsor
References
We notice empirically that changing the distribution helps when using distillation -- especially in the case of MarginMSE -- but we didn't investigate further into why this happens.
— SPLADE-v3: New baselines for SPLADE
(2403.06789 - Lassance et al., 11 Mar 2024) in Section 2.2, Better Distillation Scores