Why affine score normalization improves distillation, especially for MarginMSE

Investigate and explain why applying an affine transformation to ensemble teacher re-ranker scores to match the average and standard deviation of previous distillation scores improves distillation effectiveness—particularly under the MarginMSE loss—when training SPLADE-v3.

Background

To generate distillation targets, the authors ensemble multiple cross-encoder re-rankers and then rescale the resulting scores via an affine transformation so that the mean and standard deviation mimic those of previously used scores.

They observe empirically that this score distribution change improves distillation, notably with MarginMSE, but the underlying reason for this effect is not known and was not investigated in the paper.

References

We notice empirically that changing the distribution helps when using distillation -- especially in the case of MarginMSE -- but we didn't investigate further into why this happens.

SPLADE-v3: New baselines for SPLADE (2403.06789 - Lassance et al., 11 Mar 2024) in Section 2.2, Better Distillation Scores