Centered Kernel Alignment (CKA) Overview

Updated 18 November 2025

CKA is a similarity metric that quantifies the alignment between neural representations using centered Gram matrices and the Hilbert–Schmidt Independence Criterion.
It achieves invariance to isotropic scaling and orthogonal transformations, making it effective for comparing covariance structures across different neural architectures.
Despite its robustness, CKA is sensitive to outliers and finite-sample biases, which has led to the development of debiased and manifold-aware extensions.

Centered Kernel Alignment (CKA) is a kernel-based similarity metric designed to quantify the degree of alignment between two sets of representations, typically arising as activations from different neural network layers, distinct architectures, or alternative population encodings. CKA is defined in terms of the Hilbert–Schmidt Independence Criterion (HSIC) over centered Gram (kernel) matrices and is widely used for comparing representation spaces in deep learning, kernel learning, neuroscience, and beyond. By virtue of its normalization and centering operations, CKA attains invariance to isotropic scaling and orthogonal transformations of the respective feature spaces, allowing for robust comparison of covariance structures. The significance of CKA spans interpretability, model selection, sparsity regularization, and evaluation of generalization, but the metric has known pathologies such as sensitivity to outliers, geometric non-identifiability, and finite-sample bias—spurring the development of various extensions and corrections.

1. Mathematical Formulation and Theoretical Properties

Let $X \in \mathbb{R}^{n \times p}$ and $Y \in \mathbb{R}^{n \times q}$ denote two sets of representations, each row corresponding to the same input sample across possibly different feature dimensions. The standard (linear) CKA is constructed as follows:

Gram Matrix Construction: Compute $K = XX^\top$ , $L = YY^\top$ .
Centering: Apply the centering matrix $H = I_n - \frac{1}{n}11^\top$ , computing $K_c = HKH$ , $L_c = HLH$ .
HSIC Calculation: Define

$\operatorname{HSIC}(K, L) = \frac{1}{(n-1)^2}\mathrm{tr}(K_c L_c)$

CKA Normalization: Normalize the cross-HSIC via

$\operatorname{CKA}(K, L) = \frac{\operatorname{HSIC}(K,L)}{\sqrt{\operatorname{HSIC}(K,K) \cdot \operatorname{HSIC}(L,L)}}$

with $\operatorname{CKA}(K, L) \in [0,1]$ .

Key invariance properties are:

Orthogonal Invariance: $\operatorname{CKA}(XU, YV) = \operatorname{CKA}(X, Y)$ for orthonormal $U, V$ .
Isotropic Scaling Invariance: $\operatorname{CKA}(\alpha X, \beta Y) = \operatorname{CKA}(X, Y)$ for $\alpha, \beta > 0$ .
Well-Defined for $p,q \gg n$ : Only $n \times n$ Gram matrices are required.
Interpretation: CKA reflects the cosine similarity between the centered Gram matrices in Frobenius space, essentially comparing the entire inter-sample similarity structure (Kornblith et al., 2019, Gondhalekar et al., 2023).

CKA generalizes naturally to nonlinear kernels (e.g., Gaussian RBF), but linear CKA is most common for neural representation analysis (Alvarez, 2021).

2. Computation and Practical Implementation

Standard computation of linear CKA proceeds efficiently for moderate $n$ :

Form $K = XX^\top$ , $L = YY^\top$ .
Center using $H$ (cost $O(n^2)$ ).
Compute HSICs and normalize (total complexity dominated by kernel computation and centering).

For non-linear kernels, kernel entry computation may become significant, and parameter selection (e.g., bandwidth for RBF) has non-trivial effects. Median distance scaling is commonly adopted for bandwidth in RBF-CKA to maintain scaling invariance (Alvarez, 2021). For large $n$ , one often resorts to mini-batch or subsampled approximations (Gondhalekar et al., 2023, Ni et al., 2023). Care must be taken in high-dimensional, low-sample regimes, as naive estimators can become upwardly biased (see below).

3. Applications: Diagnostics, Regularization, and Pruning

CKA is extensively leveraged across multiple avenues in modern deep learning:

Network Interpretability: CKA maps reveal layerwise redundancy, hierarchical structure, and blockwise processing (e.g., residual blocks, spatial-temporal modules) (Kornblith et al., 2019, Vance et al., 9 Jan 2024). High off-diagonal CKA values indicate blocks of similar representations, often due to overparameterization or inadequate training (Gondhalekar et al., 2023, Vance et al., 9 Jan 2024).
Architectural Refinement: CKA identifies superfluous or static layers, informing pruning and model compression strategies. MPruner leverages CKA-based layer clustering to collapse nearly identical block sequences in both transformers and CNNs, often achieving ~50% parameter reduction with negligible accuracy degradation (Hu et al., 24 Aug 2024).
Sparsity Regularization: CKA-SR regularization, which penalizes high interlayer CKA, induces sparsity in network weights. This effect is theoretically grounded under the information bottleneck principle; minimizing interlayer CKA correlates with reduced mutual information between layers, provably encouraging sparsity (Ni et al., 2023).
Knowledge Distillation: CKA-based objectives enforce alignment not just between individual activations, but their relational (structural) distributions between teacher and student networks, both globally (across all samples) and locally (within-task, batch, or token-level), empirically improving transfer on benchmarks such as GLUE (Zhou et al., 22 Jan 2024, Jung et al., 2022).
Out-of-Distribution Generalization: In astronomy and vision, low off-diagonal CKA values across layers correlate with robust out-of-distribution (OOD) performance, while persistent high interlayer CKA signals representation collapse and failure to generalize (Gondhalekar et al., 2023).
Bayesian Deep Ensembles: In particle-based Bayesian deep learning, CKA and its hyperspherical energy extensions diversify ensemble members in feature space, leading to improved uncertainty quantification for OOD detection (Smerkous et al., 31 Oct 2024).

4. Bias, Pathologies, and Corrections

CKA has well-documented limitations:

Sensitivity to Outliers and Translations: Small, even single-point, translations of a subset of samples can dramatically reduce CKA, even while preserving all classification margins (Davari et al., 2022). Outlier sensitivity occurs in both the linear and RBF CKA, regardless of class-separability.
Finite-Sample Bias: When the number of features ( $p$ or $q$ ) exceeds the number of samples ( $n$ ), even entirely random data can spuriously yield CKA $\approx 1$ , rendering "biased" CKA estimators inappropriate for the low-sample, high-dimensional regime (Murphy et al., 2 May 2024). Similar upward biases manifest in brain-to-model and multi-ROI neuroscientific applications.
Feature-Sampling Bias: Subsampling features (as is routine with neural data or weight sharing) induces further bias, leading to severe under- or over-estimation unless geometry-corrected estimators are used (Chun et al., 20 Feb 2025). The bias scales with the "intrinsic dimensionality" (participation ratio) of the representation, not CKA itself.
Mitigation Strategies: Use of "debiased" CKA (with unbiased U-statistics for HSIC), as well as recently proposed joint input-and-feature-corrected estimators, is recommended, especially for neuroscientific and biological settings (Murphy et al., 2 May 2024, Chun et al., 20 Feb 2025). Inclusion of random- and shuffled-data baselines is essential in reporting.

5. Extensions: Manifold and Local Geometry-Aware Alignment

CKA treats all inter-sample pairs globally, leading to scale sensitivity and insensitivity to local geometric structure:

Manifold-Approximated Kernel Alignment (MKA): To address this fundamental limitation, "MKA" leverages sparse, directed $k$ -NN graphs to form local neighborhood-preserving kernels, then adapts CKA to compare the (row-centered) neighborhood matrices (Islam et al., 27 Oct 2025). This yields alignment that is robust to scale and geometric perturbations, and less sensitive to global translation or density spikes.
Empirical Findings: MKA improves ranking and discrimination of local geometric changes in synthetic topologies, outpaces CKA in perturbation and translation-invariance tests, and offers competitive or superior mean-rank on real-world benchmarks, with fewer hyperparameters and less bandwidth sensitivity.
Limitations: MKA presupposes well-defined local neighborhoods; in degenerate data manifolds or with extreme noise, the locality may misrepresent true correspondence.

CKA unifies several lines of multivariate analysis:

Kernel Alignment: Originally developed for kernel selection and learning in SVMs/regression, centered alignment maximized over convex combinations of candidate kernels yields (via quadratic programming) statistically consistent, stable kernels highly correlated with generalization performance (Cortes et al., 2012).
Relation to CCA and RV-coefficient: Linear CKA is a variance-weighted analog of mean squared Canonical Correlation Analysis (CCA); both compare multivariate similarities, but CKA is well-defined even when $p, q > n$ , unlike CCA. Invariance properties and the resilience to invertible linear transforms differ. The RV-coefficient and Tucker’s congruence coefficient are related measures, broadly subsumed under the CKA framework (Kornblith et al., 2019).
Connection to Maximum Mean Discrepancy (MMD): Maximizing CKA is equivalent to minimizing a bound on MMD (with a constant term), which explains its effectiveness as a distributional similarity metric and its non-vanishing gradients, making it powerful for knowledge distillation (Zhou et al., 22 Jan 2024).

7. Limitations, Best Practices, and Current Directions

Interpretation: High CKA does not guarantee functionally similar or transferable features, nor does low CKA conclusively reflect divergent model behavior. Complementary diagnostics (linear probes, filter visualization, class separability) are advised (Davari et al., 2022).
Kernel Selection: Linear CKA suffices in most conventional deep neural settings. For strongly nonlinear representation spaces, RBF-CKA may provide additional sensitivity, but only at well-chosen bandwidths (Alvarez, 2021).
Reporting and Controls: Always report kernel type, centering, normalization details, and perform robustness checks with random and shuffled controls. Correct for finite sample and feature bias as appropriate (Murphy et al., 2 May 2024, Chun et al., 20 Feb 2025).
Computational Complexity: Full CKA calculation scales quadratically (or worse) in $n$ ; use mini-batch or approximate computation for large datasets.
Open Problems: Causality of CKA–performance correlations, robustness to data manifold drift, and the integration of local geometric information remain active areas of research (Islam et al., 27 Oct 2025, Gondhalekar et al., 2023).

CKA, together with its bias-corrected and manifold-aware variants, has become a mainstay for representation similarity in deep learning, providing both a descriptive and actionable framework for model architecture analysis, training, and interpretability. Its theoretical guarantees, normalization, and broad empirical validation distinguish it from conventional multivariate or distance-based metrics, but practitioners are urged to apply it judiciously and in conjunction with complementary analytical tools.