Achieving Boomerang Distillation Without Storing Teacher Weights
Determine whether zero-shot intermediate-size models constructed via boomerang distillation—by patching a distilled student transformer with contiguous blocks of teacher layers—can achieve comparable performance and stability without retaining the teacher model’s weights in memory, so as to reduce the memory footprint.
References
An open question, however, is whether comparable performance and stability can be achieved without retaining the teacher weights in memory, which would substantially reduce the memory footprint.
— Boomerang Distillation Enables Zero-Shot Model Size Interpolation
(2510.05064 - Kangaslahti et al., 6 Oct 2025) in Subsection 3.3, Effect of Knowledge Distillation