On Teacher Hacking in Language Model Distillation (2502.02671v1)

Published 4 Feb 2025 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: Post-training of LMs increasingly relies on the following two stages: (i) knowledge distillation, where the LM is trained to imitate a larger teacher LM, and (ii) reinforcement learning from human feedback (RLHF), where the LM is aligned by optimizing a reward model. In the second RLHF stage, a well-known challenge is reward hacking, where the LM over-optimizes the reward model. Such phenomenon is in line with Goodhart's law and can lead to degraded performance on the true objective. In this paper, we investigate whether a similar phenomenon, that we call teacher hacking, can occur during knowledge distillation. This could arise because the teacher LM is itself an imperfect approximation of the true distribution. To study this, we propose a controlled experimental setup involving: (i) an oracle LM representing the ground-truth distribution, (ii) a teacher LM distilled from the oracle, and (iii) a student LM distilled from the teacher. Our experiments reveal the following insights. When using a fixed offline dataset for distillation, teacher hacking occurs; moreover, we can detect it by observing when the optimization process deviates from polynomial convergence laws. In contrast, employing online data generation techniques effectively mitigates teacher hacking. More precisely, we identify data diversity as the key factor in preventing hacking. Overall, our findings provide a deeper understanding of the benefits and limitations of distillation for building robust and efficient LMs.

Summary

The paper defines and detects "teacher hacking" in language model distillation, where student models deviate from ground truth by mimicking imperfections of the teacher model.
It finds that teacher hacking occurs prominently with offline datasets and can be mitigated by using online data generation which preserves data diversity.
The findings highlight analogies to reward model over-optimization and emphasize the importance of diverse data sources for effective and safe distillation practices.

Teacher Hacking in LLM Distillation: An Overview

The paper "On Teacher Hacking in LLM Distillation" investigates a critical aspect of refining LMs—the phenomenon of "teacher hacking" during the knowledge distillation process. The authors delineate this concept as an occurrence where a student LLM, in an attempt to mimic a teacher model, deviates from approximating the ground-truth data distribution due to the imperfections inherent in the teacher model. They propose a scientific framework to paper this phenomenon and present experimental evidence of its manifestation, along with potential mitigation strategies.

Key Elements of the Study

The paper outlines a methodological framework to scrutinize teacher hacking, which includes:

Controlled Experimental Setup: The experiment employs a multi-tier model structure consisting of an oracle model (approximating the truth), a teacher model (distilled from the oracle), and a student model (distilled from the teacher). This setup allows the measurement of discrepancies between these models using sequence-level metrics such as KL divergence.
Teacher Hacking Definition: Teacher hacking is characterized when the distance between the student and teacher models reduces, but the distance between the student and oracle (ground-truth) models increases, indicating an overfitting-like behavior but in the context of distillation rather than traditional training.
Data Sources in Distillation: The paper compares the effects of using fixed offline datasets versus online data generation methods during distillation. Offline data pertains to static datasets pre-generated from the teacher model, while online methods involve dynamic data generation during training.
Metrics for Evaluation: The paper introduces "golden" metrics, derived from the oracle model to indicate performance relative to the ground truth, and "proxy" metrics, derived from the teacher model, to monitor the imitation accuracy.

Findings and Implications

Detection of Teacher Hacking: The paper detects teacher hacking prominently in scenarios using offline datasets for distillation. The empirical evidence shows that, during optimization, although the proxy metrics improve (student-closer-to-teacher), the golden metrics degrade (student-further-from-oracle).
Mitigation Strategies:
- Online Data Generation: Utilizing real-time data generation considerably reduces teacher hacking. The approach maintains diversity in data, a factor identified as crucial for minimizing the mismatch between proxy and golden objectives.
- Data Diversity and Generation Budget: Increasing the diversity of prompts and generating multiple responses per prompt even within offline datasets helps alleviate the negative effects.
Scalability Concerns: The paper implies that the larger computational costs associated with online data generation are offset by the gains in model robustness and alignment with ground-truth objectives.

Theoretical and Practical Implications

Theoretically, this work highlights analogies between distillation shortcomings (teacher hacking) and reward model over-optimization (reward hacking) in RLHF (Reinforcement Learning from Human Feedback). It underscores Goodhart's law in the context of machine learning distillation pipelines: optimizing for an imperfect measure (the teacher) can lead to deviations from the desired outcome (the oracle).

Practically, these insights urge a reevaluation of typical distillation practices in LM development, especially in resource-limited deployments where smaller, distilled models are necessary. As AI systems increasingly rely on such cost-efficient models, understanding and mitigating factors like teacher hacking is imperative for maintaining model efficacy and safety.

Speculation on Future AI Developments

Future research may explore more sophisticated methodologies for dynamically adjusting the balance between reliance on teacher models and oracle-like proxies throughout the distillation process, potentially leading to adaptive distillation frameworks. Moreover, as the community refines reward models and distillation processes, the delineation of theoretically principled yet computationally efficient strategies will likely advance the broader field of scalable, efficient AI.