Manifold-Approximated Kernel Alignment (MKA)

Updated 26 November 2025

Manifold-Approximated Kernel Alignment (MKA) is a technique that measures similarity between learned representations by capturing local manifold geometry.
It employs k-nearest neighbor graph-based kernels to overcome limitations of global methods, ensuring robustness against density variations and bandwidth sensitivity.
MKA is applied for both representational alignment and transformer layer merging, offering computational efficiency and improved model compression outcomes.

Manifold-Approximated Kernel Alignment (MKA) is a family of techniques for measuring the similarity between sets of learned representations by explicitly accounting for manifold geometry in high-dimensional data. Variants of MKA have recently been applied both as a robust metric for representational alignment and as a formalism for manifold-informed model compression, such as efficient layer merging in deep neural architectures. In contrast to classical kernel alignment methods, which typically rely on global kernel functions and can be overly sensitive to density variations and bandwidth hyperparameters, MKA employs locality-preserving, graph-based kernels derived from k-nearest neighbor graphs or diffusion processes. This approach yields alignment measures that faithfully capture local manifold structure, enhancing both theoretical reliability and empirical robustness across a broad range of representation learning applications (Islam et al., 27 Oct 2025, Liu et al., 24 Jun 2024).

1. Mathematical Foundation and Motivation

Classical centered kernel alignment (CKA) quantifies the similarity between two sets of representations $X \in \mathbb{R}^{n \times d_1}$ and $Y \in \mathbb{R}^{n \times d_2}$ by computing the normalized inner product between their Gram (kernel) matrices. Specifically, given positive-definite kernels $k$ and $\ell$ , one forms $K_X, K_Y \in \mathbb{R}^{n \times n}$ with entries $K_X(i,j) = k(x_i, x_j)$ and $K_Y(i,j) = \ell(y_i, y_j)$ . CKA then measures alignment through the normalized Hilbert–Schmidt Independence Criterion: $\mathrm{CKA}(K_X, K_Y) = \frac{\langle H K_X H, H K_Y H \rangle_F}{\sqrt{\langle H K_X H, H K_X H \rangle_F \cdot \langle H K_Y H, H K_Y H \rangle_F}}$ where $H = I_n - \frac{1}{n}11^\top$ is the centering matrix.

However, CKA is limited by global density weighting, sensitivity to the kernel bandwidth parameter, and an inability to capture purely topological or local geometric equivalence. These limitations motivate the replacement of global kernels with graph-based, locally-adaptive constructions. The manifold hypothesis posits that data in high dimensions tends to lie near low-dimensional manifolds; two representations are "aligned" if they share the same intrinsic manifold geometry regardless of global rescaling or density distortions (Islam et al., 27 Oct 2025).

2. Manifold-Aware Kernels and Alignment Objective

MKA constructs sparse, locality-aware kernels $K_U, L_U$ derived from the k-nearest neighbor (k-NN) graphs of the respective representation spaces. For each sample $x_i$ , one identifies its k nearest neighbors (using Euclidean or task-appropriate distance), and defines adaptive local normalization: $\rho_i = \min_{j \neq i,\, j \in \text{KNN}(x_i)} d(x_i, x_j)$ Each row $i$ of $K_U$ is populated as:

$(K_U)_{ii} = 1$
$(K_U)_{ij} = \exp(-[d(x_i, x_j) - \rho_i]/\sigma_i)$ if $j \in \text{KNN}(x_i)$ , $j \ne i$
$0$ otherwise

The bandwidth $\sigma_i$ is chosen so that each row sums to $D = 1 + \log_2 k$ , providing outlier-robust local scaling. The alignment measure is then formulated as

$\overline{K} = K_U H,\ \overline{L} = L_U H$

$\mathrm{MKA}(K_U, L_U) = \frac{\langle \overline{K}, \overline{L} \rangle}{\sqrt{ \langle \overline{K}, \overline{K} \rangle \langle \overline{L}, \overline{L} \rangle }}$

Under the constant-row-sum assumption, this reduces to: $\mathrm{MKA}(K_U, L_U) = \frac{ \langle K_U, L_U \rangle - D^2 } { \sqrt{ ( \langle K_U, K_U \rangle - D^2 )( \langle L_U, L_U \rangle - D^2 ) } }$ All claims above appear explicitly in (Islam et al., 27 Oct 2025). This objective is computationally efficient ( $O(nk)$ for $n$ samples and $k$ neighbors), bypasses global centering, and provides a scale- and density-insensitive measure of manifold congruence.

3. Algorithmic Implementation and Scalability

MKA’s practical workflow involves:

Constructing a k-NN graph in each representation space, using either brute-force or approximate nearest neighbor search (FAISS, HNSW).
Computing adaptive bandwidths and filling sparse kernels.
Calculating the alignment score via inner products of the locally-normalized kernels.

The computational bottleneck is the neighbor search, which dominates at $O(n \log n\, d)$ for $d$ -dimensional data, while subsequent steps scale as $O(nk)$ . Sparse-matrix inner-product computations further enhance efficiency. The method supports Laplacian regularization, enabling explicit penalization or enforcement of alignment along shared manifold directions: $\mathrm{MKA}_{\alpha,\beta} = \frac{ \mathrm{HSIC}(K_X,K_Y) + \alpha\,\mathrm{tr}(K_XL) + \beta\,\mathrm{tr}(K_YL) } { \sqrt{ [\mathrm{HSIC}(K_X,K_X) + \alpha^2 \|L\|_F^2][\mathrm{HSIC}(K_Y,K_Y)+\beta^2 \|L\|_F^2] } }$ where $L$ is the graph Laplacian, and $\alpha,\beta$ modulate the contribution of manifold smoothness (Islam et al., 27 Oct 2025). Complexity is further reduced through approximate search and sparsity.

4. Empirical Evaluation and Representational Robustness

MKA consistently outperforms or matches traditional CKA and its local variants (kCKA, IMD, RTD) in both synthetic and real-world settings (Islam et al., 27 Oct 2025):

Synthetic topological tests: MKA robustly captures manifold-equivalence (e.g., Swiss-roll vs S-curve), is less sensitive to $k$ , and tracks ground-truth topology across a range of deformations, clusterings, noise, and global translation.
Stability: MKA alignment remains stable with varying sample size $n$ , dimension $d$ , and translation, in contrast to CKA, which drifts without meticulous bandwidth tuning.
Representational Benchmarks: On tasks such as IN100 (vision, ResNet/VGG/ViT), MNLI (NLP, BERT/ALBERT), and Cora/Flickr/OGBN-Arxiv (graph neural nets), MKA matches or surpasses CKA and kCKA in ranking model similarity, correlating with test accuracy and output divergence, and detecting architectural or data-induced alignment.
Layer Correspondence: In multi-layer neural networks, MKA provides a refined view of inter-layer similarity that highlights manifold-level shifts, largely eliminating spurious block structures observed under classical CKA.

5. Application to Model Compression: Layer Merging in Transformers

A related development is the use of manifold geometry for layer merging in large transformer models, described as Manifold-Based Knowledge Alignment (also MKA) (Liu et al., 24 Jun 2024). Here, the guiding principle is that two transformer layers with activation statistics that lie on nearly the same low-dimensional manifold can be merged without incurring substantial accuracy loss.

The procedure is:

Obtain activation matrices $\mathbf{H}^l$ for each layer $l$ using a representative sample.
Apply diffusion-based embeddings (diffusion maps) to project activations into $\mathbb{R}^{n \times d'}$ ( $d' \ll d$ ), isolating intrinsic geometry.
Compute the Normalized Pairwise Information Bottleneck (NPIB) between diffusion embeddings, measuring mutual information normalized by entropy.
Merge layer pairs with highest NPIB-similarity using adaptive convex combinations of parameters.
Iterate until the desired compression ratio or similarity threshold is reached.

On Llama3-8B, MKA achieves a compression ratio of 43.75% (reducing from 32 to 18 layers) with only a 2.82% drop in MMLU accuracy; comparable pruning methods lose 20–40 points. When combined with quantization, accuracy remains above 61.7%—significantly higher than one-shot pruning baselines (Liu et al., 24 Jun 2024).

6. Strengths, Limitations, and Open Questions

MKA’s merits arise from its attention to local geometry:

Robustness: Insensitive to global scale or density, consistently capturing topological equivalence.
Computational Efficiency: Use of sparse $k$ -NN kernels and approximate nearest neighbor techniques enables scalability to large $n$ .
Single Hyperparameter: Only $k$ (number of neighbors) must be set, eschewing the ambiguous global bandwidth parameter of RBF-CKA.
Versatility: Applicable to diverse domains (vision, NLP, graph data) and suited for both analytical comparison and downstream purposes such as model compression.

Limitations include:

Single-Scale Kernel: May miss multi-scale manifold features; multi- $k$ averages are a potential extension.
Distance Metric Dependence: Alignment quality relies on the appropriate choice of distance—Euclidean may be suboptimal for non-Euclidean manifolds.
Non-Mercer Kernels: The indefinite nature of $K_U$ implies that some standard kernel properties do not apply directly.
Compression Boundary: When used for transformer compression, manifold learning accuracy depends crucially on the representativeness of activation samples; misalignment can degrade performance, particularly for early layers encoding distinct functional content (Liu et al., 24 Jun 2024, Islam et al., 27 Oct 2025).

Open questions focus on the development of multi-scale variants, automated $k$ selection, robust distance metrics, and theoretical conditions for guaranteed alignment under manifold diffeomorphism. Empirical demonstrations suggest that MKA, by capitalizing on manifold geometry, provides a wide-ranging framework for both accurate representation alignment and practical, information-preserving model compaction.