Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 175 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 180 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Centered Kernel Alignment (CKA) Similarity

Updated 14 November 2025
  • CKA similarity is a scale-invariant, kernel-based metric that quantifies the alignment between representation pairs by comparing their centered Gram matrices.
  • It is invariant to orthogonal transformations and isotropic scaling, making it ideal for analyzing neural network layers, cross-modal systems, and representation diagnostics.
  • Debiased CKA estimators address finite-sample bias in high-dimensional settings, enabling more reliable model pruning, transfer learning, and cross-system alignment.

Centered Kernel Alignment (CKA) similarity is a scale-invariant, kernel-based metric that quantifies the similarity between two sets of representations—such as neural activations across layers, models, or even species—by comparing the pairwise similarity structures induced over a shared set of observations. Widely adopted in neuroscience, deep learning, and multimodal AI, CKA underpins both empirical analysis and algorithmic regularization, notably in cross-system alignment, representation diagnostics, pruning, and transfer learning frameworks.

1. Mathematical Definition and Properties

Let XRn×d1X \in \mathbb{R}^{n \times d_1} and YRn×d2Y \in \mathbb{R}^{n \times d_2} be two sets of representations, each row corresponding to the same nn samples (stimuli, tokens, images, etc.), and (typically) column-centered. Define linear Gram (kernel) matrices

K=XXL=YYK = XX^\top \qquad L = YY^\top

and the centering matrix

H=In1n1n1n.H = I_n - \frac{1}{n} \mathbf{1}_n\mathbf{1}_n^\top.

Centered (feature) Gram matrices are

Kc=HKH,Lc=HLH.K_c = HKH, \qquad L_c = HLH.

The empirical Hilbert–Schmidt Independence Criterion (HSIC) is

HSIC(K,L)=tr(KcLc)\mathrm{HSIC}(K,L) = \operatorname{tr}(K_c L_c)

(usually divided by (n1)2(n{-}1)^2, but this factor cancels in normalized CKA).

Centered Kernel Alignment (CKA) is then defined as

CKA(X,Y)=tr(KcLc)tr(Kc2)tr(Lc2).\boxed{ \mathrm{CKA}(X, Y) = \frac{\operatorname{tr}(K_c L_c)} {\sqrt{\operatorname{tr}(K_c^2)\, \operatorname{tr}(L_c^2)}}. }

For linear kernels, if XX and YY are zero-mean column-wise,

CKALinear(X,Y)=XYF2XXFYYF.\mathrm{CKA}_\text{Linear}(X,Y) = \frac{ \| X^\top Y \|_F^2 } { \| X^\top X \|_F \cdot \| Y^\top Y \|_F }.

CKA yields a real scalar in [0,1][0,1], with 1 if and only if the row spaces coincide up to orthogonal transformation and scaling.

Invariance Structure

  • Orthogonal invariance: XXQX \gets X Q or YYRY \gets Y R (QQ, RR orthonormal) leaves CKA unchanged.
  • Isotropic scaling invariance: XαXX \gets \alpha X, YβYY \gets \beta Y leaves CKA unchanged.
  • Not invariant to arbitrary invertible transforms: CKA is sensitive to non-orthogonal shape deformations.
  • Dependencies on mean removal: Centering is essential; uncentered data yield misleadingly high alignment for trivial shifts.

These invariances distinguish CKA from CCA, Procrustes, and RV coefficients (Kornblith et al., 2019, Harvey et al., 12 Nov 2024, Davari et al., 2022).

2. Statistical and Algorithmic Foundations

CKA is grounded in the Hilbert–Schmidt Independence Criterion (Harvey et al., 12 Nov 2024). For two neural populations or representation sets, the normalized trace in CKA\mathrm{CKA} computes the cosine of the “angle” between their centered similarity matrices in Frobenius space. This aligns their geometric structures rather than just individual responses.

Equivalently, CKA can be derived from the average alignment between 2\ell_2-regularized linear decoders (i.e., the mean normalized squared inner product between optimal readouts for random regression tasks over the population), establishing a tight link between geometry and functional alignability (Harvey et al., 12 Nov 2024). Linear CKA thus quantifies the normalized average agreement of optimal linear decoders across systems.

\textbf{Variants:} CKA extends to non-linear kernels (e.g., Gaussian RBF), at the cost of higher O(n2)O(n^2) or O(n3)O(n^3) memory/time, but most modern deep learning applications adopt the linear variant (Cloos et al., 26 Sep 2024).

Twelve major variants arise (kernel: linear or RBF) × (HSIC estimator: biased, unbiased, tril) × (scoring: “score” [0,1][0,1], or “angular” [0,π2][0, \frac{\pi}{2}]) (Cloos et al., 26 Sep 2024). These must not be conflated; different research communities have implemented various forms.

3. Bias and Estimation in High Dimensions

Finite-Sample Bias

CKA’s popularity in high-dimensional, low-sample settings (e.g., neuroscience: PNP \gg N) exposes substantial finite-sample bias. The naive (biased) estimator tends to 1 even for completely random, unaligned representations as feature/sample ratio grows (Murphy et al., 2 May 2024, Chun et al., 20 Feb 2025). This causes false discoveries of alignment when comparing, e.g., large fMRI ROIs to deep network layers or distinct networks on shared input (Murphy et al., 2 May 2024).

Debiased and Bias-Corrected Estimators

Finite-sample correction is achieved via U-statistic-based unbiased centering (Murphy et al., 2 May 2024): A~ij=aij1N2kaik1N2kakj+1(N1)(N2)k,ak,(ij)\widetilde A_{ij} = a_{ij} - \frac{1}{N-2} \sum_{k} a_{ik} - \frac{1}{N-2} \sum_{k} a_{kj} + \frac{1}{(N-1)(N-2)} \sum_{k,\ell} a_{k\ell}, \quad (i \ne j) and final unbiased HSIC and CKA via

HSIC^unb(K,L)=1N(N3)ijK~ijL~ij,\widehat{\mathrm{HSIC}}_\mathrm{unb}(K,L) = \frac{1}{N(N-3)} \sum_{i\ne j} \widetilde K_{ij} \, \widetilde L_{ij},

CKAdebiased(K,L)=HSIC^unb(K,L)HSIC^unb(K,K)HSIC^unb(L,L).\mathrm{CKA}_\mathrm{debiased}(K,L) = \frac{\widehat{\mathrm{HSIC}}_\mathrm{unb}(K,L)} {\sqrt{ \widehat{\mathrm{HSIC}}_\mathrm{unb}(K,K) \, \widehat{\mathrm{HSIC}}_\mathrm{unb}(L,L) }}.

Further generalization corrects both stimulus and feature sampling; this estimator retains near-unbiasedness down to very sparse neuronal sampling (Chun et al., 20 Feb 2025).

Input-Driven Confounds

When the structure of the input data dominates, even networks initialized with random weights exhibit high CKA in shallow layers. Covariate-adjusted regression (dCKA) removes the influence of the input similarity structure, resolving spurious alignments (Cui et al., 2022).

4. Practical Algorithms and Usage

Efficient Computation

For large-scale settings (nn up to 10410^4 and dd in 10310^310410^4), linear CKA can be implemented without explicit n×nn \times n Gram matrices (Kornblith et al., 2019). Central steps:

  1. Center columns of X,YX, Y.
  2. Compute C=XYC = X^\top Y, SX=XXS_X = X^\top X, SY=YYS_Y = Y^\top Y.
  3. Compute numerator: CF2\|C\|_F^2, denominator: SXFSYF\|S_X\|_F \cdot \|S_Y\|_F
  4. Return CKA: CF2/(SXFSYF)\|C\|_F^2 / (\|S_X\|_F \cdot \|S_Y\|_F).

For non-linear kernels, matrix computations scale as O(n2d)O(n^2d) to O(n3)O(n^3).

Model Pruning and Regularization

CKA serves as an explicit criterion in pruning and training regularization:

  • Layer/Block Pruning: Group layers with CKA >τ> \tau as “redundant”; prune all but one, retrain. For BERT and T5, τ=0.98\tau = 0.98–0.99 yields up to 50% reduction without accuracy loss (Hu et al., 24 Aug 2024, Pons et al., 27 May 2024).
  • Sparse Training: Minimizing interlayer CKA provably reduces mutual information and increases sparsity through the information bottleneck (Ni et al., 2023).
  • Cross-System Alignment: In LLMs for multilingual MT, layer-wise CKA alignment secures cross-lingual feature sharing; e.g., CKA-based terms at mid-layers yield \sim1 BLEU/chRF point gain in low-resource translation (Nakai et al., 3 Oct 2025).

Example: CKA for Alignment Regularization in MT

For parallel sentence pairs (x(A),x(B))(x^{(A)},x^{(B)}), extract H(A),H(B)RT×dH^{(A)}_\ell, H^{(B)}_\ell \in \mathbb{R}^{T \times d}, flatten over tokens/batch, center, and compute

CKA(X,Y)=XcYcF2XcXcF2YcYcF2.\mathrm{CKA}(X, Y) = \frac{ \| X_c^\top Y_c \|_F^2 } { \sqrt{ \| X_c^\top X_c \|_F^2 } \sqrt{ \| Y_c^\top Y_c \|_F^2 } }.

Apply a loss penalty LCKA=1CKA(X,Y)L_{\mathrm{CKA}} = 1 - \mathrm{CKA}(X,Y) at layer \ell (Nakai et al., 3 Oct 2025).

Subspace-Level CKA

Global CKA can obscure fine-grained, trait-relevant leakage: subspace-level CKA restricts evaluation to task-discriminative directions (e.g., a single projection from a logistic regression classifier), revealing transferability not measured by global similarity (Okatan et al., 2 Nov 2025).

CKAtrait-subspace(ZTU,ZSU)\mathrm{CKA}_\text{trait-subspace}(Z_T U, Z_S U)

where UU spans the trait-relevant basis.

Thresholds on trait-subspace CKA and projection-penalty interventions can reduce leakage with no main-task loss.

5. Empirical Observations and Limitations

Interpretation and Sensitivity

  • Dominance by Principal Components: Linear CKA disproportionately emphasizes alignment of high-variance principal directions (Cloos et al., 9 Jul 2024). Misalignment of leading PCs leads to rapid score drop, while low-variance PCs contribute weakly; this differs from Procrustes or Bures measures which are linearly sensitive.
  • Functional Correspondence: CKA and Procrustes best correlate with behaviorally meaningful distinctions in both neuroscience and vision models, outperforming predictivity or CCA in differentiating trained/untrained networks (Bo et al., 21 Nov 2024).
  • Lack of Universal Thresholds: No “good” CKA value is universal; the threshold for functionally relevant transfer varies by data, task, and metric (Cloos et al., 9 Jul 2024).
  • Manipulation and Cautions: CKA is highly sensitive to outliers and can be manipulated independently of task performance; similar CKA scores may not imply functional equivalence (Davari et al., 2022).

Summary Table: CKA Features and Caveats

Aspect Mathematical Property Empirical Impact / Caveat
Orthogonal/scaling Invariant Captures subspace rather than basis
Arbitrary transform Not invariant Sensitive to shape, not mere isomorphism
Principal components Quadratic sensitivity Led by top-variance dimensions
Outlier sensitivity Non-robust Single-point shifts can suppress CKA
Data bias Inflated under PNP \gg N Debias or covariate adjust for fairness
Kernel choice Linear (fast), RBF (nonlinear) Linear default in deep learning
Score range 0,1; 0, π/2 Multiple definitions in the literature

6. Cross-Domain, Multimodal, and Large-Scale Applications

CKA underlies modern approaches to:

  • Cross-modal alignment: Relating vision and language encoders (including unaligned models) using global or localized CKA, with high CKA (0.7\sim 0.7) between SS vision and language encoders, providing a foundation for zero-shot matching and retrieval via CKA-based quadratic assignment (Maniparambil et al., 10 Jan 2024).
  • Brain-model alignment: Applied to fMRI, MEG, and direct recordings; unbiased CKA recovers brain region–layer correspondences not visible with biased estimators or shuffled controls (Murphy et al., 2 May 2024, Chun et al., 20 Feb 2025).
  • Standardization and reproducibility: Major repositories now catalog 100+ similarity measures, standardizing “linear–gretton–score,” “rbf–unbiased–angular,” etc. to resolve confusion in literature (Cloos et al., 26 Sep 2024).

7. Recommendations and Best Practices

CKA remains a principal—though not unproblematic—instrument for dissecting neural representation geometry and functional alignment in complex artificial and biological systems, especially when augmented with bias-correction and subspace diagnostics.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Centered Kernel Alignment (CKA) Similarity.