Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones? (2502.19557v1)

Published 26 Feb 2025 in cs.CL and cs.AI

Abstract: Distilling LLMs typically involves transferring the teacher model's responses through supervised fine-tuning (SFT). However, this approach neglects the potential to distill both data (output content) and reward signals (quality evaluations). Extracting reliable reward signals directly from teacher models is challenging, as LLMs are optimized for generation rather than evaluation, often resulting in biased or inconsistent assessments. To address this limitation, we propose a novel distillation pipeline that transfers both responses and rewards. Our method generates pseudo-rewards through a self-supervised mechanism that leverages the inherent structure of both teacher and student responses, enabling reward learning without explicit external evaluation. The reward model subsequently guides reinforcement learning (RL), allowing iterative refinement of the student model after an SFT warm-up phase. Experiments on GSM8K and MMLU-PRO demonstrate that our method consistently outperforms traditional SFT-based approaches, enabling student models to surpass the performance of their teachers. This work highlights the potential for scalable, efficient distillation through structured self-supervised reward learning, reducing dependence on external reward supervision.

Summary

The paper introduces a novel LLM distillation pipeline integrating both data (via varied confidence responses) and self-supervised reward signals (from a trained reward model).
This dual-phase approach, combining SFT with RL guided by the reward model, consistently outperforms SFT baselines, significantly improving student model performance on benchmarks like GSM8K and MMLU-Pro.
The method demonstrates that distilling reward signals allows smaller models to not just mimic but potentially surpass larger teachers, offering a path to more computationally efficient yet powerful LLMs.

Distill Not Only Data but Also Rewards: A New Approach in LLM Distillation

The paper entitled "Distill Not Only Data but Also Rewards: Can Smaller LLMs Surpass Larger Ones?" outlines an innovative methodology in the field of LLM distillation, a domain where reducing the computational demand of LLMs while maintaining or enhancing performance is a pivotal challenge. The authors propose a highly structured distillation approach that transcends traditional supervised fine-tuning (SFT) by integrating both data and reward signals. The primary contribution lies in an original pipeline that utilizes self-supervised pseudo-reward generation, leveraging both teacher and student model responses, and subsequently refines the student model through reinforcement learning (RL).

Methodological Insights

This research introduces a distillation pipeline that consists of two key phases: data distillation and reward distillation. In the data distillation phase, the teacher LLM generates multiple responses in varied confidence settings to synthesize a comprehensive dataset for fine-tuning. High-confidence settings provide pseudo-labels, while low-confidence settings enhance diversity, collectively forming a robust foundation for initial student model training via SFT.

Reward distillation employs a novel self-supervised approach to circumvent the dependency on explicit external reward signals, which are often unreliable due to the inconsistency and biases of direct teacher evaluations. A reward model is trained to evaluate student-generated answers in alignment with pseudo-labels, ensuring that preferred responses reflect superior reasoning paths rather than merely correct conclusions. The resultant reward model guides RL, allowing iterative performance enhancement beyond the initial SFT warm-up, enabling the student model to outperform its teacher under specific scenarios.

Numerical Results

Empirical evaluations demonstrate that the proposed method consistently surpasses baseline SFT-based approaches. Notably, the approach effectively improves student model performance across varying scenarios and teacher abilities. For instance, under a powerful teacher such as Llama3-70B, the accuracies for a 1B student model increase by 3.03% on GSM8K and 4.27% on MMLU-Pro post-distillation. Remarkably, under certain configurations, the student model not only reaches but outperforms the teacher's capabilities, underscoring the potential of integrating reward signals in distillation processes.

Theoretical and Practical Implications

Theoretically, this work pushes the boundaries of knowledge distillation by establishing a framework where smaller models are not just mimicking larger counterparts, but are systematically improved through a comprehensive understanding of response quality. Practically, such an approach could significantly reduce the computational cost associated with deploying powerful LLMs while delivering competitive, or even superior, performance in appropriate contexts.

Future Directions

Future research could focus on refining the pseudo-reward system, potentially automating more nuanced reward feedback mechanisms to further enhance model learning efficacy. Expanding the applications to varied non-textual tasks could also illuminate the broader potential of this paradigm across different domains within AI. Moreover, understanding the long-term model behavior under reward-driven learning paradigms remains an enticing avenue for investigation, especially concerning the stability of improvements and the scope for generalization.

In conclusion, this paper presents a significant contribution to the field of AI model optimization, pointing towards scalable and efficient methods that blend both reactionary and evaluative learning components. This dual-channel approach to distillation signifies an important step in the pursuit of accessible and powerful LLMing capabilities.

Tweets

https://twitter.com/fly51fly/status/1895590273000096019

https://twitter.com/semisance/status/1895418571389559078

https://twitter.com/essobi/status/1905305656237695179