Papers
Topics
Authors
Recent
Search
2000 character limit reached

Attention-weighted CKA for Knowledge Distillation

Updated 10 February 2026
  • The paper introduces ACCKA, extending classic CKA by incorporating token-level attention weights to focus on salient audio regions during knowledge distillation.
  • ACCKA eliminates the need for explicit projection layers by naturally handling mismatched embedding dimensions between teacher and student models.
  • Empirical results within the PL-Distill framework demonstrate that ACCKA enables efficient compression while achieving or exceeding teacher performance in speech emotion recognition.

Attention-weighted Centered Kernel Alignment (ACCKA) is a kernel similarity measure designed to enhance alignment between representations of teacher and student models, particularly in the context of knowledge distillation for large audio-LLMs (LALMs). ACCKA extends classic Centered Kernel Alignment (CKA) by incorporating attention-based weighting at the level of individual time steps (audio tokens), thereby emphasizing regions deemed important by the teacher model's attention mechanism. This framework both highlights salient local structure and naturally accommodates mismatched feature spaces between teacher and student, obviating the need for explicit projection layers. ACCKA is a cornerstone of the PL-Distill framework for knowledge distillation in @@@@1@@@@ (SER), enabling efficient model compression while retaining or even exceeding teacher-level performance (Yang et al., 2 Feb 2026).

1. Foundation: Centered Kernel Alignment (CKA)

Centered Kernel Alignment is a normalized similarity measure between two sets of features XRn×pX \in \mathbb{R}^{n \times p} and YRn×qY \in \mathbb{R}^{n \times q}. The linear kernel Gram matrices, KX=XXK_X = X X^\top, KY=YYK_Y = Y Y^\top, are centered using

H=In1n1n1n,H = I_n - \frac{1}{n} \mathbf{1}_n \mathbf{1}_n^\top,

resulting in K~X=HKXH\widetilde{K}_X = H K_X H, and similarly for YY. The linear CKA is defined as

CKA(X,Y)=K~X,K~YFK~XFK~YF\mathrm{CKA}(X, Y) = \frac{\langle \widetilde{K}_X, \widetilde{K}_Y\rangle_F}{\|\widetilde{K}_X\|_F \, \|\widetilde{K}_Y\|_F}

where A,BF=tr(AB)\langle A,B\rangle_F = \mathrm{tr}(A^\top B) denotes the Frobenius inner product. CKA is closely related to the Hilbert-Schmidt independence criterion (HSIC) and measures the similarity of covariance structure, remaining invariant to isotropic invertible linear transforms of XX or YY. Notably, CKA accommodates feature spaces of differing dimensionality (pqp \ne q) and scales to high dimensions (Cortes et al., 2012, Yang et al., 2 Feb 2026).

2. ACCKA: Attention-weighted Extension

Attention-weighted Centered Kernel Alignment generalizes CKA by injecting importance weights reflecting token-level attention from the teacher model. For audio inputs, let the teacher's last-layer self-attention from the final 'response' token to audio tokens be A=(A1,...,AL)RLA = (A_1, ..., A_L)^\top \in \mathbb{R}^L. Normalize these to a probability vector,

wi=Aij=1LAj;i=1Lwi=1,w_i = \frac{A_i}{\sum_{j=1}^L A_j}; \quad \sum_{i=1}^L w_i = 1,

yielding weights wRLw \in \mathbb{R}^L.

Each embedding row (time step) in both teacher and student representations is scaled by wiw_i:

HT=diag(w)Ha(T)RL×ET,HS=diag(w)Ha(S)RL×ES.H_T = \mathrm{diag}(w) \, \mathcal{H}_a^{(T)} \in \mathbb{R}^{L \times E_T},\qquad H_S = \mathrm{diag}(w) \, \mathcal{H}_a^{(S)} \in \mathbb{R}^{L \times E_S}.

Embeddings are then centered by subtracting their columnwise means. The attention-weighted CKA ("ACCKA") is

ACCKA(Ha(T),Ha(S),w)=H^TH^SF2H^TH^TFH^SH^SF,\mathrm{ACCKA}(\mathcal H_a^{(T)}, \mathcal H_a^{(S)}, w) = \frac{\| \widehat{H}_T^\top \widehat{H}_S \|_F^2}{\| \widehat{H}_T^\top \widehat{H}_T \|_F \, \| \widehat{H}_S^\top \widehat{H}_S \|_F},

where the centering operator H=IL(1/L)1L1LH = I_L - (1/L) \mathbf{1}_L \mathbf{1}_L^\top is applied after weighting. ACCKA directs the alignment measure toward acoustically or semantically salient regions, as defined by the teacher's attention, improving the focus of knowledge transfer (Yang et al., 2 Feb 2026).

3. Objective Function and Optimization

The distillation loss at the projector-level is defined by the negative of the ACCKA similarity:

LPDist=1ACCKA(Ha(T),Ha(S),w).\mathcal{L}_{\mathrm{PDist}} = 1 - \mathrm{ACCKA}(\mathcal H_a^{(T)}, \mathcal H_a^{(S)}, w).

The goal is to minimize this loss, thereby maximizing correspondence between the statistical geometry of teacher and student embeddings at attention-critical time steps. Unlike adversarial or regression-based distillation losses, ACCKA requires no additional regularization, as normalization ensures the score remains bounded.

4. Handling Mismatched Embedding Dimensions

A fundamental property of both CKA and ACCKA is that embedding dimensionalities for teacher (ETE_T) and student (ESE_S) need not match. The formulation only requires products of the form H^TH^SRET×ES\widehat{H}_T^\top \widehat{H}_S \in \mathbb{R}^{E_T \times E_S}, avoiding any explicit projection between feature spaces. Thus, the projector-level MLPs for teacher and student are free to evolve independently. ACCKA aligns the empirical covariance structures of these spaces, facilitating knowledge transfer even when teacher and student operate with different representational capacities (Yang et al., 2 Feb 2026).

5. Computational Implementation

The main stages of ACCKA computation are as follows:

  1. Normalization of attention: w=A/(A+ϵ);  ϵ=106w = A / (\sum A + \epsilon); \; \epsilon = 10^{-6}
  2. Application of weights: Multiply wiw_i with each row of the corresponding teacher and student embeddings.
  3. Centering: Subtract per-column means from weighted embeddings.
  4. Covariance computation: Form cross-covariance matrices CTSC_{TS}, CTTC_{TT}, CSSC_{SS}.
  5. Frobenius norms: Compute num=CTSF2\text{num} = \|C_{TS}\|_F^2 and den=CTTFCSSF+ϵ\text{den} = \sqrt{\|C_{TT}\|_F \|C_{SS}\|_F} + \epsilon.
  6. Final ACCKA score and loss: accka=num/den\text{accka} = \text{num} / \text{den}, LPDist=1accka\mathcal{L}_{\mathrm{PDist}} = 1 - \text{accka}.

The entire process is batchable, numerically stable with standard floating-point precision, and robust to division-by-zero through ϵ\epsilon-stabilization. The computational complexity per sample is O(LETES+LET2+LES2)O(L E_T E_S + L E_T^2 + L E_S^2), scaling linearly with sequence length LL and quadratically with embedding dimensions (dominated by the larger of ETE_T or ESE_S) (Yang et al., 2 Feb 2026).

6. Statistical and Learning-Theoretic Properties

Classic centered alignment, as formalized by Cortes, Mohri, and Rostamizadeh (Cortes et al., 2012), admits concentration bounds, kernel learning guarantees via convex quadratic programming, and stability-based generalization theorems. The extension to attention-weighted alignment introduces new statistical considerations: concentration now depends on maximal weights wijw_{ij}, and stability must account for bi-level fitting if ww is optimized on the same data. Proper regularization of the attention vector is necessary to avoid overfitting, though in ACCKA ww is fixed by the teacher's attention and thus not subject to direct optimization. Theoretical tools such as algorithmic stability and Rademacher complexity can be adapted to accommodate weighted kernels, provided constraints on maxwij\max w_{ij} are observed (Cortes et al., 2012).

7. Applications and Significance

ACCKA is deployed within the PL-Distill framework to enable projector-level knowledge distillation for LALMs applied to speech emotion recognition (SER). By combining ACCKA-guided projector-level alignment with logits-level KL divergence minimization, PL-Distill achieves compression of an 8.4B-parameter teacher to a 1.1B-parameter student while consistently outperforming both the teacher and SOTA baselines across diverse SER benchmarks (IEMOCAP, RAVDESS, SAVEE). ACCKA's conceptual innovation is its use of teacher-driven attention to selectively transfer representation structure, and its formal kernel-theoretic underpinning ensures robust alignment without requiring ad hoc dimension matching or additional regularization (Yang et al., 2 Feb 2026).

A plausible implication is that the ACCKA formalism may generalize to other cross-modal or structured distillation settings where attention signals signal salience. Its computational efficiency and precise handling of embedding mismatch make it a compelling candidate for ongoing research in model compression and transfer learning.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Attention-weighted Centered Kernel Alignment.