Achieving Boomerang Distillation Without Storing Teacher Weights

Determine whether zero-shot intermediate-size models constructed via boomerang distillation—by patching a distilled student transformer with contiguous blocks of teacher layers—can achieve comparable performance and stability without retaining the teacher model’s weights in memory, so as to reduce the memory footprint.

Background

Boomerang distillation is introduced as a procedure that first distills a large pretrained transformer (teacher) into a smaller student, and then creates intermediate-size models without additional training by patching the student with contiguous blocks of teacher layers. The method relies on alignment between student and teacher, achieved through distillation losses and per-layer cosine alignment, to enable zero-shot interpolation of size and performance.

In analyzing the role of individual loss terms, the paper observes that initialization from teacher weights and alignment losses contribute to stable interpolation. The authors then highlight a practical constraint: boomerang distillation, as formulated, assumes access to teacher weights to patch the student. They explicitly pose the question of whether comparable performance and stability can be obtained without retaining teacher weights in memory, which would substantially reduce memory footprint and broaden applicability.

References

An open question, however, is whether comparable performance and stability can be achieved without retaining the teacher weights in memory, which would substantially reduce the memory footprint.

— Boomerang Distillation Enables Zero-Shot Model Size Interpolation (2510.05064 - Kangaslahti et al., 6 Oct 2025) in Subsection 3.3, Effect of Knowledge Distillation

Achieving Boomerang Distillation Without Storing Teacher Weights

Background

References

Related Problems