In Good GRACEs: Principled Teacher Selection for Knowledge Distillation (2511.02833v1)

Published 4 Nov 2025 in cs.LG and cs.CL

Abstract: Knowledge distillation is an efficient strategy to use data generated by large "teacher" LLMs to train smaller capable "student" models, but selecting the optimal teacher for a specific student-task combination requires expensive trial-and-error. We propose a lightweight score called GRACE to quantify how effective a teacher will be for post-training a student model. GRACE measures distributional properties of the student's gradients without access to a verifier, teacher logits, teacher internals, or test data. From an information-theoretic perspective, GRACE connects to leave-one-out stability of gradient-based algorithms, which controls the generalization performance of the distilled students. On GSM8K and MATH, GRACE correlates strongly (up to 86% Spearman correlation) with the performance of the distilled LLaMA and OLMo students. In particular, training a student using the GRACE-selected teacher can improve the performance by up to 7.4% over naively using the best-performing teacher. Further, GRACE can provide guidance on crucial design choices in distillation, including (1) the best temperature to use when generating from the teacher, (2) the best teacher to use given a size constraint, and (3) the best teacher to use within a specific model family. Altogether, our findings demonstrate that GRACE can efficiently and effectively identify a strongly compatible teacher for a given student and provide fine-grained guidance on how to perform distillation.

Summary

The paper introduces the GRACE metric that evaluates student gradients to enable principled teacher selection for efficient knowledge distillation.
It reports up to 86% correlation between GRACE scores and student performance on tasks like GSM8K and MATH, outperforming traditional metrics.
Results guide optimal hyperparameter tuning, such as generation temperature, and demonstrate practical benefits in large-scale language model distillation.

Summary of "In Good GRACEs: Principled Teacher Selection for Knowledge Distillation"

Introduction

The paper addresses the challenge of selecting the optimal teacher for knowledge distillation in training smaller student models using large teacher models. The authors introduce a metric, $\text{\textbf{GRACE}}$ , that measures the effectiveness of a potential teacher based solely on student's gradient properties during distillation, avoiding the need for direct access to the teacher's internals or any external verifier.

GRACE Score

The GRACE score leverages a novel approach by evaluating the distributional properties of the student's gradients on subsets of the teacher's generated data. It connects closely to gradient cross-validation concepts and is linked to the theoretical framework of leave-one-out stability and conditional mutual information bounds. GRACE is lightweight, calculated using only the student's generated gradients, and efficiently determines not just the best teacher, but also optimal hyperparameters, such as generation temperature.

Experimental Setup

Experiments were conducted using datasets such as GSM8K and MATH, comparing GRACE against several baselines including G-Vendi and gradient-norm metrics. A variety of teacher models from distinct families were evaluated to ensure robustness across different teaching strategies.

Key Findings

Correlation with Student Performance: GRACE achieves high correlation with post-training student performance, up to 86% on specific datasets, outperforming traditional metrics like teacher performance.
Improvement in Student Performance: Selecting teachers based on GRACE yields notable performance improvements in students over merely using the best-performing teacher measured by standard accuracy.
Guiding Distillation Practices: GRACE helps identify optimal generation temperatures and assists in choosing the most suitable teacher within constraints such as model size and family, providing practical guidance for distillation.

Figure 1: GRACE correlates with student performance after distillation on math-related reasoning tasks.

Practical Implications

GRACE's application potential is significant in large-scale LLM distillation, where computational resources are a constraint. Its guidance can lead to more efficient student models, making it particularly beneficial for settings where extensive testing of teacher-student combinations is infeasible.

Conclusions

GRACE represents a robust metric for teacher selection in knowledge distillation, bridging theoretical insights with practical distillation challenges. Its strong predictive capacity regarding student performance after distillation makes it an invaluable tool in optimizing LLMs. Future work could explore its application beyond mathematical tasks and integration with adaptive preconditioner matrices to further refine its efficacy.