Representation-Level Distillation

Updated 4 July 2026

Representation-level distillation is a method that transfers internal representations (hidden states, embeddings, etc.) to capture semantic structure and cross-modal regularities.
It employs diverse alignment units like hidden states, spans, and relational graphs to maintain structural information and support model compression.
It integrates information-theoretic objectives and flexible, multi-scale strategies to boost performance across language, vision, and audio modalities.

Representation-level distillation is a family of distillation methods in which the transferred object is an internal representation rather than only an output distribution. In this formulation, the target of transfer may be an intermediate hidden state, a final embedding, a relation among embeddings, a graph over channels or nodes, a prototype set, or a tokenizer-agnostic span representation. Across the literature, the same basic shift recurs: logits are treated as an incomplete summary of the teacher, while hidden spaces are treated as the locus where semantic structure, temporal organization, geometric relations, and cross-modal regularities are actually encoded (He et al., 2021, Zhang et al., 2023, Boix-Adsera, 2024).

1. Definition and formal scope

In the Distiller framework, representation-level distillation appears as the intermediate term $l_{i,j}^{\text{inter}}(H_i^{\text{T}}, H_j^{\text{S}})$ , where $H_i^{\text{T}}$ and $H_j^{\text{S}}$ are hidden states from teacher and student layers, and this intermediate term is optimized jointly with prediction-layer distillation and supervised loss (He et al., 2021). This formulation makes representation transfer a first-class component of the KD pipeline rather than an auxiliary regularizer.

A broader interpretation appears in language-model compression work that rejects the assumption that “each word representation is independent.” “Distilling Linguistic Context for LLM Compression” defines contextual distillation through two relational objects: Word Relation, which models relations among token representations within a layer, and Layer Transforming Relation, which models relations of the same token across layers (Park et al., 2021). In this view, the distilled object is not a vector at a single position but the geometry of contextual representations.

A still broader formalization is given by “Towards a theory of model distillation,” which defines PAC-distillation for a source class $\mathcal{F}$ and target class $\mathcal{G}$ , and treats representation-level distillation as a change of representation class itself, such as neural networks distilled into explicit juntas or decision trees (Boix-Adsera, 2024). Here the student is not merely a smaller parametric approximation of the teacher; it may be a structurally different representation of the same function. This extends the scope of representation-level distillation from hidden-state matching to representational re-expression.

2. Distilled objects and alignment units

The most useful way to classify representation-level distillation is by the unit on which alignment is imposed. Different papers choose different units because the granularity of the task differs: sentence retrieval favors whole-embedding alignment, action detection favors temporally structured features, and cross-tokenizer distillation favors spans rather than tokens.

Alignment unit	Typical formulation	Representative papers
Hidden states and layers	Layer-wise hidden-state matching or MI-based intermediate losses	(He et al., 2021, Yang et al., 4 Jun 2026)
Final embeddings	Final sentence or node embeddings as the distilled target	(Zhang et al., 2023, Joshi et al., 2021)
Relational structure	Word relations, channel graphs, covariance, boundary saliency	(Park et al., 2021, Wang et al., 2024, Dai et al., 2021)
Local regions and instances	Patch- or window-level features, bags of instances, multi-scale pooling	(Bi et al., 2024, Wang et al., 9 Feb 2025)
Spans	Attention-weighted span Centers of Mass over tokenizer-agnostic spans	(Dao et al., 2 May 2026)
Projections across domains	Sparse layer-to-layer frame-level feature regression from multiple teachers	(Chang et al., 23 Jun 2025)

This diversity is not superficial. In “Text Representation Distillation via Information Bottleneck Principle,” the distilled object is the final sentence representation $S$ , and the optimization target is the mutual information between $S$ and the teacher representation $T$ (Zhang et al., 2023). In OPRD, by contrast, the distilled object is the collection of hidden states $h_{\theta,t}^{(l)}$ across selected layers and selected rollout positions, so the supervision remains internal to the decoder stack and bypasses the LM head entirely (Yang et al., 4 Jun 2026).

SRA illustrates a different response to alignment difficulty. When teacher and student use different tokenizers, token-level matching becomes brittle, so SRA shifts the unit of alignment from tokens to spans recovered from character offsets. Each span is treated as a cluster of particles, and its state is represented by a Center of Mass, defined as an attention-weighted average of the token hidden states in that span (Dao et al., 2 May 2026). This move is notable because it changes the unit of comparison without changing the underlying language modeling task.

3. Objective families

A central result from Distiller is that many standard intermediate objectives can be reinterpreted as mutual-information surrogates. The paper shows that minimizing MSE, L2, or PKD loss, and maximizing cosine similarity between teacher and student hidden states, are equivalent to maximizing lower bounds of the mutual information $I(X;Y)$ between the corresponding random variables (He et al., 2021). On that basis it proposes the MI- $H_i^{\text{T}}$ 0 family, a tunable class of intermediate distillation objectives with an explicit bias–variance trade-off.

IBKD makes the information-theoretic viewpoint explicit. Its objective is

$H_i^{\text{T}}$ 1

where $H_i^{\text{T}}$ 2 is the student representation, $H_i^{\text{T}}$ 3 is the teacher representation, and $H_i^{\text{T}}$ 4 is the input text (Zhang et al., 2023). In practice, the paper maximizes $H_i^{\text{T}}$ 5 with an InfoNCE loss and uses HSIC as a proxy for penalizing $H_i^{\text{T}}$ 6. The resulting design principle is not simply “make the student close to the teacher,” but “retain teacher-relevant information while compressing input-dependent variation.”

WCoRD takes a related but distinct route by leveraging both forms of Wasserstein distance. Its dual form yields a global contrastive objective that maximizes a lower bound of mutual information between teacher and student representations, while its primal form performs local contrastive knowledge transfer within a mini-batch by matching feature distributions (Chen et al., 2020). This positions Wasserstein geometry as an alternative to KL-based or purely Euclidean feature matching.

The same information-theoretic tendency appears in “Information Theoretic Representation Distillation,” which introduces two complementary losses inspired by a cheap entropy-like estimator, aimed at maximizing correlation and mutual information between student and teacher representations (Miles et al., 2021). Although the details available are limited to the abstract, the paper is representative of a broader trend: representation-level distillation increasingly replaces pointwise regression with objectives that control information content and dependence structure.

4. Structural, relational, and multi-scale preservation

A large part of the literature treats representation-level distillation not as pointwise alignment but as preservation of structure. In contextual distillation for LLMs, Word Relation is a horizontal relational constraint over tokens within a layer, while Layer Transforming Relation is a vertical relational constraint over the same token across layers (Park et al., 2021). In action detection, the corresponding structural objects are temporal: atomic-level distillation aligns snippet features contrastively, while sequence-level distillation transfers Global Contextual Relations and Action Boundary Saliency so that the student inherits temporal context and boundary structure from a stronger modality such as optical flow or pose (Dai et al., 2021).

A more graph-theoretic treatment appears in “Exploring Graph-based Knowledge: Multi-Level Feature Distillation via Channels Relational Graph.” There, an intermediate feature map is converted into a Channels Relational Graph whose vertices are channels and whose edges are cosine similarities between flattened channel maps (Wang et al., 2024). Distillation is then decomposed into a vertex loss for raw features, an edge loss for inter-channel relations, and a spectral embedding loss derived from the normalized graph Laplacian. The paper’s attention-guided mechanism weights spatial regions, channels, and channel relations, so the student is asked to preserve both local activations and the global geometry of the teacher’s feature graph.

Graph neural network distillation yields a similar contrast between local and global structure. “On Representation Knowledge Distillation for Graph Neural Networks” compares LSP, which preserves local structural relationships over edges, with GSP, which matches all pairwise similarities, and with G-CRD, which uses contrastive learning to preserve global topology by aligning student node embeddings to teacher node embeddings in a shared representation space (Joshi et al., 2021). The reported analysis shows that structure-preserving methods tend to preserve either local or global relationships, whereas G-CRD balances both.

Fine-grained vision further sharpens the structural argument. CMD treats patches as instances and crops as bags in a multiple-instance learning sense, then performs intra-level and inter-level multi-instance distillation on image-level and region-level representations (Bi et al., 2024). MSDCRD, in turn, uses multi-scale sliding-window pooling to decouple a global feature map into local representations at several granularities, categorizes those local teacher features, and applies a contrastive distillation loss over the resulting multi-scale feature set (Wang et al., 9 Feb 2025). SRA extends the same logic to tokenizer mismatch by adding a geometric regularizer over pairwise span distances so that the structural integrity of the teacher’s span representation space is preserved under cross-tokenizer transfer (Dao et al., 2 May 2026).

5. Domain-specific realizations

In text representation learning, representation-level distillation is used to compress bi-encoders and sentence embedding models without reducing the problem to label logits. IBKD distills the final sentence representation for Semantic Textual Similarity and Dense Retrieval, while CKD distills contextual knowledge encoded in token relations and layer transformations, and Distiller systematizes intermediate-layer objectives over NLP tasks such as classification, regression, and span tagging (Zhang et al., 2023, Park et al., 2021, He et al., 2021).

In autoregressive LLMs, the same paradigm appears under different operational constraints. OPRD lifts on-policy distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely (Yang et al., 4 Jun 2026). SRA handles cross-tokenizer and cross-architecture distillation by constructing tokenizer-agnostic spans and aligning their attention-weighted Centers of Mass (Dao et al., 2 May 2026). A post hoc representational study of distilled reasoning models shows another facet of the field: distilled Qwen variants contain unique reasoning feature directions corresponding to self-reflection, deductive reasoning, alternative reasoning, and contrastive reasoning, and larger distilled models show indications of more structured geometry (Baek et al., 5 Mar 2025). This suggests that representation-level distillation is not only a training technique but also a tool for analyzing how reasoning structure emerges.

Outside language, the same design principles recur. In video, a cross-modal teacher distills an augmented RGB representation by transferring atomic snippet-level features together with global contextual relations and boundary saliency (Dai et al., 2021). In audio, USAD trains a single encoder from multiple domain-specific SSL teachers using sparse layer-to-layer distillation on hidden representations, so that one student can serve speech, sound, and music tasks (Chang et al., 23 Jun 2025). In fNIRS emotion recognition, OMCRD performs online mutual learning among lightweight students by distilling region-level and channel-level representations with an Inter-Subject Interaction Contrastive Representation loss (Lai et al., 2024). In jet physics, jBOT combines local particle-level distillation with global jet-level distillation and reports emergent semantic class clustering in the frozen embedding (Tsoi et al., 16 Jan 2026).

These cases show that representation-level distillation is modality-agnostic but not unit-agnostic. The distilled object changes with the data geometry: tokens for language, nodes for graphs, channels for CNNs, particles for jets, regions and channels for fNIRS, and frames for audio.

6. Efficiency, tensions, and theoretical implications

One recurring empirical claim is that representation-level design is often the most consequential part of the KD pipeline. Distiller reports that the approach used to distill the intermediate representations is the most important factor in KD performance, that MI- $H_i^{\text{T}}$ 7 performs best among the tested intermediate objectives, and that different datasets and tasks prefer different KD algorithms (He et al., 2021). This already argues against a single canonical recipe.

The literature also reveals a tension between flexibility and fidelity. Relation-based objectives such as CKD are explicitly motivated by the claim that matching pair-wise or triple-wise relationships is possible “without being directly affected by the structure,” so they tolerate substantial architectural change (Park et al., 2021). SRA takes the same position with respect to tokenizer mismatch by shifting from tokens to spans (Dao et al., 2 May 2026). OPRD takes the opposite position: it aligns hidden states directly and therefore assumes the same architecture, while reporting that cross-model hidden states become almost orthogonal when student and teacher sizes differ substantially (Yang et al., 4 Jun 2026). Representation-level distillation is therefore not uniformly architecture-flexible; the alignment unit determines how much heterogeneity is admissible.

Efficiency claims likewise depend on objective choice. ITRD emphasizes significantly less training overheads than other approaches while remaining competitive on knowledge distillation and cross-model transfer (Miles et al., 2021). USAD argues for sparse layer-to-layer distillation and finds that a mask-free regression objective can be more robust in a multi-domain setting (Chang et al., 23 Jun 2025). OPRD reports that hidden-state supervision removes sampling variance, trains $H_i^{\text{T}}$ 8 faster, and uses $H_i^{\text{T}}$ 9 less memory than top- $H_j^{\text{S}}$ 0 on-policy output-space distillation (Yang et al., 4 Jun 2026). These are not generic advantages of “feature distillation”; they arise from particular choices about where the distilled signal is taken and how it is normalized.

Theoretical work sharpens the picture further. Under the linear representation hypothesis, neural networks that implicitly compute decision trees can be distilled into explicit decision tree representations in polynomial time, and the broader PAC-distillation framework shows that distillation can be much cheaper than learning from scratch (Boix-Adsera, 2024). At the same time, the agnostic case remains subtle: there are source–target pairs for which agnostic distillation is as hard as learning, and no VC-like finite-character combinatorial parameter exactly characterizes its complexity (Boix-Adsera, 2024). This suggests that representation-level distillation is best understood as a design space over alignment units, geometry, and structural assumptions, rather than as a single technique reducible to hidden-state MSE.