The Distillation Game: Adaptive Attacks & Efficient Defenses

Published 21 May 2026 in cs.LG and cs.AI | (2605.22737v1)

Abstract: Distillation attacks create a deployment trade-off for model providers: the same outputs that make a model more useful can also make it easier to imitate. We study this trade-off through a minimax game between a utility-constrained teacher and an adaptive student. Our framework yields tractable one-sided response rules: an adaptive evaluation rule in which the student reweights high-value examples, and a teacher-side defense template that suppresses outputs most useful for distillation. From a cheap proxy for example value, we derive Product-of-Experts (PoE), a simple forward-pass-only defense that combines the teacher with a proxy student during generation. Empirically, adaptive evaluation reveals a large passive--adaptive gap: on state-of-the-art defenses, adaptive students recover substantially more capability than passive evaluation suggests on GSM8K and MATH. Under this stronger evaluation, the apparent robustness gap between expensive defenses and PoE narrows considerably, while PoE remains substantially cheaper and preserves higher-quality reasoning traces. Overall, our results suggest that strong distillation remains difficult to stop, and that progress on antidistillation should be judged against adaptive students rather than passive ones. Our code is available at: https://github.com/ysfalh/distillation-game.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper presents a game-theoretic formulation where the teacher and student engage in a minimax game to balance the utility of outputs and the risks of distillation.
An adaptive student uses exponential tilting to prioritize high-value traces, achieving up to a 50% improvement in capability recovery over passive methods.
The efficient Product-of-Experts defense suppresses valuable traces while preserving reasoning quality and lowering computational costs compared to gradient-based approaches.

Game-Theoretic Distillation: Adaptive Attacks and Efficient Defenses

Formulation of the Distillation Game

The paper "The Distillation Game: Adaptive Attacks & Efficient Defenses" (2605.22737) introduces a principled game-theoretic framework for analyzing distillation attacks and defenses in LLMs. Distillation attacks exploit the rich outputs exposed by a deployed model ("teacher") to train imitation models ("students"), presenting an inherent trade-off between utility and risk: increased transparency and detail in model outputs provide valuable training signals to adversaries seeking to replicate capabilities.

The proposed framework models the interaction between the teacher and student as a minimax game. The teacher releases outputs constrained by KL-divergence fidelity to the underlying reference model, while the student can adaptively reweight or filter released samples, subject to its own KL adaptation budget. Central to the framework is a value function $v(x, y)$ which quantifies the utility of a sample $(x, y)$ for distillation.

The theoretical analysis yields explicit closed-form best responses: the student employs an exponential tilting strategy to concentrate training on high-value traces, and the teacher counteracts via exponential suppression of those traces in its output distribution. This generalizes prior approaches and enables systematic evaluation and design of defenses.

Figure 1: An adaptive attacker does not train uniformly on all teacher outputs; it estimates the usefulness of each queried sample and assigns larger training weight to higher-value responses.

Adaptive Attack and Defense Mechanisms

The adaptive student, optimizing the minimax game's inner response, exponentially reweights released examples according to $v(x, y)$ , sharply favoring those with greatest downstream impact (Algorithm 1). This strategic weighting exposes a significant gap between passive and adaptive evaluation: empirical results show adaptive students recover substantially more capability from defended traces.

On the teacher side, the theoretically optimal defense requires suppressing traces most valuable for distillation. Since the true value function is costly to compute, practical defenses use proxies. The paper derives Product-of-Experts (PoE), a fast, forward-pass only defense that geometrically combines teacher and proxy-student predictions. The likelihood gap between teacher and student guides suppression—outputs with large teacher-over-student likelihood are preferentially downweighted. This defense is cheaper than gradient-based approaches such as Antidistillation Sampling (ADS) and preserves trace auditability.

Empirical Results and Evaluation

Experiments are conducted on GSM8K and MATH benchmarks, evaluating DeepSeek-R1-Distill-Qwen-7B as teacher, Qwen2.5-3B as proxy student, and Llama-3.2-3B as final student. Three teacher variants (standard, ADS, PoE) are tested across a utility-distillability frontier.

Results demonstrate:

Adaptive evaluation reveals substantially more distillation leakage: Under adaptive weighting, student accuracy after distillation increases by approximately 50% over passive evaluation for both ADS and PoE (e.g., GSM8K, passive vs. adaptive student accuracy: $34\%$ vs $52\%$ , PoE: $39\%$ vs $49\%$ ).
The gap between expensive and cheap defenses narrows under adaptive attack: PoE, while computationally efficient ( $1.6\times$ teacher runtime compared to ADS's $2.9\times$ ), performs comparably or slightly better than ADS when judged by adaptive student recovery.
PoE preserves reasoning trace quality: A rubric-based judge (Claude Sonnet 4.6) rates PoE traces higher in auditability and logical structure compared to ADS, confirmed by human evaluation with $\kappa=0.76$ agreement.

Figure 2: GSM8K utility–distillability frontier illustrates how adaptive evaluation (student accuracy) shifts leakage upward for all defense methods.

Figure 3: Student accuracy after distillation from commercial frontier-model outputs under varied exposure formats demonstrates that reasoning traces yield significant recovery, summary traces are less informative but still useful, and answer-only outputs are weakest.

Figure 4: Reasoning trace word count distribution; PoE typically produces shorter, more focused traces than ADS or standard teachers.

Theoretical and Practical Implications

The paper's core theoretical contribution is the unified minimax game formulation, with tractable responses, connecting defense and evaluation in antidistillation. The empirical findings underscore that evaluating against adaptive attackers is essential, as defenses optimized for passive threats provide a misleading sense of security.

Practically, PoE's efficiency and trace preservation make it an attractive defense, especially as adaptive distillers (filtering for high-value traces) are realistic threats and significantly degrade the robustness of conventional methods such as ADS.

The framework is modular—defense and attack rules depend on the chosen value function and modeling budgets. Future work may refine value proxies, extend adaptation (e.g., generative trace inversion), and diversify models and tasks.

Conclusion

This work establishes a rigorous game-theoretic methodology for evaluating and designing distillation defenses in LLMs. Adaptive evaluation is critical: defenses are substantially weaker against adversaries that exploit trace value, and efficient methods like Product-of-Experts are competitive under realistic threat models. The approach lays a foundation for future antidistillation research, advocating for explicit adaptive attacker specification and practical defense design informed by auditability and computational efficiency.

Markdown Report Issue