Knowledge Distillation Must Account for What It Loses

Published 28 Apr 2026 in cs.LG and cs.AI | (2604.25110v1)

Abstract: This position paper argues that knowledge distillation must account for what it loses: student models should be judged not only by retained task scores, but by whether they preserve the teacher capabilities that make those scores reliable. This matters because distillation is increasingly used to turn large, often frontier models into deployable systems, yet headline metrics can hide losses in uncertainty, boundary behavior, process reliability, on-policy stability, grounding, privacy, safety, and diversity. We identify the retention assumption behind current evaluation and reframe distillation as a lossy projection of teacher behavior rather than a faithful copy. We then synthesize existing evidence into a taxonomy of off-metric distillation losses, showing that these losses are concrete, recurring, and measurable. To make the position actionable, we propose scenario-specific preservation targets and a Distillation Loss Statement that reports what was preserved, what was lost, and why the remaining losses are acceptable. The goal is not lossless distillation, but accountable distillation.

Abstract PDF Upgrade to Chat

Authors (1)

Wenshuo Wang

Summary

The paper challenges conventional distillation by demonstrating that headline metrics overlook critical teacher capabilities like uncertainty, robustness, and safety.
It introduces the concept of 'lossy projection' to quantify the divergence between teacher and student models using scenario-specific stress tests.
The study proposes Distillation Loss Statements as a framework to document and justify acceptable off-metric capability losses in deployment.

Distillation as Accountable Transformation: Losses Beyond Retained Metrics

Position and Motivation

The paper "Knowledge Distillation Must Account for What It Loses" (2604.25110) challenges existing conventions in knowledge distillation by asserting that evaluating student models primarily on headline task metrics is insufficient. The central argument is that classical and contemporary distillation workflows, spanning LLMs, agent-based architectures, and specialized students, routinely claim successful transfer if the student matches—or nearly matches—the teacher on a benchmark or scalar metric (e.g., accuracy, pass@k, downstream utility). However, this retention-centric evaluation omits many critical teacher capabilities that determine reliability and trustworthiness in real deployments, including uncertainty calibration, robustness, boundary behavior, grounding, privacy, safety, and diversity of outputs.

This oversight is particularly notable given the growing use of distillation to render large frontier models deployable in practical settings, where failure modes invisible to aggregate metrics may have high consequence.

Formalization: Retention Assumption and Lossy Projection

The paper makes explicit the retention assumption implicitly adopted in most distillation literature: that matching teacher performance on a primary metric implies retention of all necessary capabilities for deployment. This is formally described by the (unsound) implication $m(S) \approx m(T) \Rightarrow c_i(S) \approx c_i(T)$ for all scenario-critical capabilities $c_i$ . In practice, distillation optimizes over observed signals which serve as projections of the teacher’s full behavioral vector. The student matches teacher outputs in the projected subspace but remains unconstrained, or only weakly constrained, in unobserved dimensions. This reframing as "lossy projection" sets up the notion of off-metric distillation loss: for any capability $c$ , the loss $\Delta_c(T,S;\mathcal{D}_c)$ quantifies divergence on a relevant stress distribution.

The paper’s claim is not that distillation should be lossless or preserve all teacher traits, but that consequential losses must be named, measured, and justified in context.

Evidence and Taxonomy of Off-Metric Losses

Empirical and theoretical evidence synthesized from the literature demonstrates teacher-student divergence across dimensions that are weakly coupled—or decoupled—from main metrics:

Distributional Loss: Student may retain top-k answers but lose fidelity in probability estimates, entropy, or alternative outputs. Statistical analyses (e.g., [Menon et al. 2021], [Stanton et al. 2021]) show that soft-target information is not fully transferred by optimizing for final-score retention.
Representation/Relation Loss: Output similarity does not guarantee transfer of intermediate representations, attention structures, or relational properties ([Wang et al. 2020], [Ojha et al. 2023]).
Capability Boundary Loss: Students often retain average-case performance but degrade in OOD robustness, adversarial resistance, or subgroup reliability ([Shao et al. 2022], [Stacey & Rei 2024]).
Privacy/Memorization Loss: Distillation can inherit or amplify teacher memorization and privacy risks ([Jagielski et al. 2023], [Zhang et al. 2025]).
Calibration/Abstention Loss: Correct answers do not entail reliable uncertainty estimation or abstention behavior ([Guo et al. 2017], [Hebbalaguppe et al. 2024]).
Process/On-Policy Stability Loss: Distilled models may present plausible reasoning traces but lack actual process reliability, self-verification, and recovery mechanisms ([Hsieh et al. 2023], [Song & Zheng 2026]).
Grounding/Provenance Loss: Factual answer text does not imply evidence-grounded output or correct attribution ([Lewis et al. 2020], [Huang et al. 2024]).
Safety-Boundary Loss: Retaining refusal text or aggregate safety scores neglects evaluation of selective refusal, jailbreak robustness, and boundary prompts ([Cui et al. 2024], [Zhang et al. 2026]).
Long-Tail/Diversity Loss: Students often capture high-frequency patterns but lose rare modes, minority classes, and distributional heterogeneity ([Shumailov et al. 2024]).

Single-domain studies reinforce the claim: metamorphic testing of code-distilled models found performance drops up to 285% under adversarial transformation and up to 62% behavioral discrepancy, which traditional accuracy-based metrics miss ([Awal et al. 2025]).

Scenario-Specific Preservation and Distillation Loss Statements

The paper proposes shifting the evaluation paradigm from aggregate metric retention to scenario-specific preservation of teacher capabilities that underpin reliability and deployment trust. It recommends:

Identifying for each distillation scenario which teacher traits are critical.
Selecting relevant behavioral proxies, observable teacher signals, and stress distributions for evaluation.
Reporting and justifying observed losses in a standardized Distillation Loss Statement, detailing primary metrics, critical off-metric capabilities, preservation targets, stress tests, divergence, accepted losses, and deployment implications.

This approach aligns with broader documentation practices in ML (e.g., model cards, datasheets), and recognizes that the acceptability of loss is context-dependent, especially in high-risk or privacy-sensitive domains.

Alternative Views and Adjacent Work

Contradictory perspectives are addressed:

Expected Loss: Distillation is inherently lossy—authors should document and justify which losses are acceptable.
Low-Risk Scope: Retention metrics may suffice for closed, low-stakes tasks, but not for open, high-risk deployments.
Measurement Burden: Universal thresholds are impractical; scenario-specific proxies and stress distributions should be prioritized.
Black-Box Access: Even when teacher signals are inaccessible, behavioral proxies can surface consequential divergence.
Corrective Divergence: Students may intentionally improve on harmful teacher traits; loss accounting should distinguish intended corrections from detrimental erosion.

Related work on fidelity studies, capability preservation, process faithfulness, and domain-specific loss evaluation underpins the position but lacks a formal reporting norm for capability retention in distillation.

Conclusion

The paper mandates a paradigm shift in distillation research: from score-retention-centric claims to explicit, accountable loss assessment. The retention assumption is insufficient. Distilled students frequently diverge from teachers in deployment-critical behaviors that headline task metrics obscure. Scenario-specific preservation targets and Distillation Loss Statements are required to make these losses visible, measurable, and contextually justified. This framework enables rigorous interpretation of student capability, mitigates unexamined erosion, and informs both deployment and review in settings where compression trades off against reliability, privacy, diversity, and safety. Future developments may push for standardized behavioral proxies, expanded stress distributions, and automated tools for scenario-guided loss accounting, further operationalizing responsible distillation across evolving AI contexts.

Markdown Report Issue