White-Box Knowledge Distillation

Updated 9 April 2026

White-box knowledge distillation is a method where the student model learns from a teacher’s internal activations such as feature maps, attention maps, and relational structures.
It employs techniques like feature map matching, attention transfer, and relational distillation to achieve closer alignment and improved generalization.
Empirical results in vision and language tasks indicate that white-box KD effectively narrows the teacher–student performance gap compared to output-only distillation.

White-box knowledge distillation is a class of model compression techniques in which the student network is exposed not only to the teacher’s outputs (predictions or logits), but also to its internal representations—feature maps, attention maps, hidden states, or relational structures. This approach provides a richer supervisory signal than black-box distillation (which matches only the outputs), facilitating deeper knowledge transfer and enabling student models to more accurately approximate the representational and behavioral properties of the teacher, particularly when architectures differ or when the target task requires dense or structured outputs (Mansourian et al., 15 Mar 2025).

1. Foundational Principles and Taxonomy

White-box knowledge distillation (KD) builds on the teacher–student paradigm, relaxing the restriction that the student can only access the teacher’s output distributions. Instead, it allows for matching at various levels within the network. Typical white-box KD sources include:

Intermediate feature maps: Internal activations within convolutional or transformer layers.
Attention maps: Channel- or spatial-level attention or self-attention matrices.
Relational features: Pairwise or higher-order relations among features or samples.

Each source enables the transfer of “dark knowledge”—information residing in the teacher’s representational structure beyond final predictions—thus enhancing the capacity and generalization of the student when compared to logit-only KD (Mansourian et al., 15 Mar 2025).

2. Representative Methodologies

Several canonical white-box KD approaches have emerged, each targeting a specific form of intermediate alignment:

Feature map matching: Methods such as FitNet and SSKD directly regress student feature maps to those of the teacher, using per-stage $\ell_2$ losses and architectural adapters (e.g., $1 \times 1$ convolutions) to resolve channel or spatial mismatches (Gao et al., 2018, Mansourian et al., 15 Mar 2025).
Attention transfer: Channel- or spatial-aggregated attention maps are distilled, enforcing similarity of salient activations (Mansourian et al., 15 Mar 2025).
Relational distillation: Approaches like FSP or RKD supervise the student to match the patterns of similarity or geometry (distance, angles) between features, often across layers or images (Mansourian et al., 15 Mar 2025).
Global knowledge distillation via prototypes: Methods for dense prediction tasks generate a shared basis (“prototypes”) in both teacher and student spaces, aligning global representations that are robust to instance-level noise (Tang et al., 2022).

Table 1 illustrates typical white-box KD sources and representative objectives:

Source Type	Example Method	Alignment Objective
Feature maps	FitNet, SSKD	$\\| \Phi_s(F_s) - F_t \\|_2^2$
Attention maps	AT	$\\| \hat{A}(F_s) - \hat{A}(F_t) \\|_2^2$
Relational structures	FSP, RKD	$\\|\phi(F_s^i, F_s^j) - \phi(F_t^i, F_t^j)\\|_2^2$
Prototype-based global	PGM, RDM	$\\|\Lambda^s - \Lambda^t\\|_2^2$

3. Mathematical Formulation and Optimization Techniques

White-box KD generally introduces additional loss terms that encourage the student’s internal representations to mimic those of the teacher. For feature transfer, let $F_t^{(i)}, F_s^{(i)}$ denote the $i$ -th stage outputs of teacher and student, respectively. The alignment loss for $K$ stages is: $\mathcal{L}_{KD} = \sum_{i=1}^{K} \| F_t^{(i)} - \text{Proj}(F_s^{(i)}) \|^2$ where $1 \times 1$ 0 denotes any necessary transformation (e.g., $1 \times 1$ 1 conv or bilinear resizing) to match shapes (Gao et al., 2018).

In generative LLMs, sequence-level KD can be framed as minimizing divergences (KL, reverse KL, JS, etc.) between student and teacher output distributions: $1 \times 1$ 2

$1 \times 1$ 3

For tasks where the support of the teacher is large and multi-modal—as in open-ended generation—reverse KL is often preferred, as it prevents the student from overestimating low-probability regions and emphasizes high-precision mode alignment (Gu et al., 2023).

Global prototype-based methods construct a shared basis via a prototype generation module (PGM), enabling alignment in a reduced and more stable subspace. The loss jointly aligns the reconstructed coordinates and penalizes inconsistencies between teacher and student projections: $1 \times 1$ 4 where $1 \times 1$ 5 are matrices of $1 \times 1$ 6 prototypes, $1 \times 1$ 7 are reconstruction coefficients, and $1 \times 1$ 8 is a regularization parameter (Tang et al., 2022).

For LLMs with heterogeneous output spaces or tokenizations, dual-space distillation projects each model’s representations into the other’s vector space before computing divergence, ensuring tight alignment in both spaces and enabling KD across different vocabularies (Zhang et al., 2024).

4. Empirical Results and Comparative Performance

White-box KD methods consistently outperform black-box or logit-only distillation in a wide range of vision and language tasks. Key findings across representative studies:

Stage-by-Stage KD (SSKD): On CIFAR-100, SSKD achieves 70.77% Top-1 accuracy for a ResNet-20 student (vs. 67.96% from scratch; +0.8 to +2 points over prior KD/AT/NST methods). On ImageNet, SSKD consistently narrows the gap to the teacher (Gao et al., 2018).
Prototype-based distillation for detection: On COCO, Faster R-CNN (Res101→Res50) improves from the student's 38.4 mAP to 40.6 mAP, surpassing the teacher by +0.8; similar trends on Pascal VOC (Tang et al., 2022).
MiniLLM (reverse-KLD for LLMs): Across GPT-2, OPT, and LLaMA student models, MiniLLM yields +2–6 point improvements in both Rouge-L and GPT-4 feedback over sequence-level KD and SFT baselines. Calibration error and exposure bias are also improved (Gu et al., 2023).
Dual-space knowledge distillation: DSKD achieves gains of +0.4–1.4 Rouge-L (GPT2-120M) and +3.3 on TinyLLaMA, including +0.56 to +5.22 over best prior methods in cross-vocabulary scenarios (Zhang et al., 2024).

These findings confirm that, by leveraging richer internal signals, white-box KD closes the performance gap to the teacher more effectively, especially on dense prediction, sequence modeling, and representation-intensive tasks (Mansourian et al., 15 Mar 2025).

5. Practical Considerations, Scalability, and Limitations

White-box KD offers tangible advantages but introduces new challenges:

Architectural adapters: Aligning teacher and student intermediate representations requires shape matching, often via additional projectors ( $1 \times 1$ 9 convolutions, MLPs, up/downsampling). This adds parameter and computational overhead.
Stability and hyperparameter selection: Staged approaches (e.g., SSKD), prototype updates, temperature for divergences, and mix-in ratios (in RL-inspired LLM KD) must be carefully tuned. Certain choices (e.g., reverse KL vs. forward KL) yield dramatic differences in student behavior and require white-box access (Gu et al., 2023).
Computational cost: Feature-level alignment can entail large memory and compute costs, especially for relational or global methods, and if many intermediate layers are distilled (Tang et al., 2022).
Task dependence: Mode-seeking distillation (reverse KL) may degrade diversity for generation tasks that demand coverage over rare modes; staged or prototype methods are less effective for very short or low-complexity tasks (Gu et al., 2023, Gao et al., 2018).
Cross-architecture and cross-vocab bridging: Methods like DSKD and cross-model attention enable KD even with non-matching output spaces, addressing the prevalent deployment scenario in LLM compression (Zhang et al., 2024).

A table of common hyperparameters from recent white-box KD works is shown below:

Hyperparameter	Typical Value(s)	Source
Number of stages ( $\\| \Phi_s(F_s) - F_t \\|_2^2$ 0)	4 (per downsampling block)	(Gao et al., 2018)
Distillation temperature ( $\\| \Phi_s(F_s) - F_t \\|_2^2$ 1)	2.0	(Zhang et al., 2024)
Prototype count ( $\\| \Phi_s(F_s) - F_t \\|_2^2$ 2)	10 (per class)	(Tang et al., 2022)
Mix-in ratio ( $\\| \Phi_s(F_s) - F_t \\|_2^2$ 3)	0.2 (teacher in response mix)	(Gu et al., 2023)

6. Extensions and Open Challenges

Several open questions and directions continue to shape the field of white-box KD:

Optimal source selection: The choice of which intermediate layers or representations to distill is largely heuristic. Automated schemes or learned selection criteria could improve transfer efficiency (Mansourian et al., 15 Mar 2025).
Efficient transformation architectures: Lightweight but expressive adapters (using linear projectors, MLPs, or diffusion-based denoisers) are increasingly deployed, but the tradeoff between fidelity and efficiency remains central (Mansourian et al., 15 Mar 2025).
Scaling to foundation models: Expanding white-box KD to very large teachers (e.g., CLIP, SAM, 34B LLMs) and heterogeneous student backbones introduces severe representation mismatches and demands robust cross-space alignment methods (Mansourian et al., 15 Mar 2025).
Integration with other compression and efficiency techniques: Joint strategies involving quantization, pruning, or meta-learned/dynamic KD hold promise for further scaling and deployment on resource-limited devices (Mansourian et al., 15 Mar 2025).
Handling cross-tokenizer/cross-vocabulary scenarios: Methods such as DSKD with cross-model attention extend white-box KD to the increasingly common case of incompatible tokenizations, supporting a broader range of LLM distillation pipelines (Zhang et al., 2024).

7. Impact and Application Domains

White-box knowledge distillation has enabled student models to match or even exceed the teacher’s performance in several core domains:

Vision: Outperforms black-box KD in classification, detection, and segmentation, particularly for models with non-matching architectures and in settings with small, occluded, or noisy instances (Gao et al., 2018, Tang et al., 2022, Mansourian et al., 15 Mar 2025).
Language modeling: Mode-seeking distillation and dual-space representation matching substantially increase generative quality, calibration, and long-sequence performance for compact instruction-following models (Gu et al., 2023, Zhang et al., 2024).
Object detection and dense prediction: Prototype- and global knowledge-based approaches achieve robust spatial and relational alignment, outperforming both teacher and prior KD baselines on COCO and Pascal VOC (Tang et al., 2022).
Contemporary model alignment and safety: Iterative, ranking-based white-box cycles (e.g., CycleAlign) enable low-resource, high-fidelity alignment of smaller models to black-box or API-only teachers, such as ChatGPT, for human-value alignment tasks (Hong et al., 2023).

A plausible implication is that, as model architectures diversify and deployment settings become more heterogeneous, white-box KD will increasingly serve as a foundation for both performance-focused compression and for trustworthy, controllable ML systems. Future research will address automation, efficiency, and universal applicability across architectures and modalities.