Papers
Topics
Authors
Recent
2000 character limit reached

CKA-SR Regularization for Deep Learning

Updated 26 November 2025
  • CKA-SR Regularization is a spectral technique that uses centered kernel alignment to measure and control the similarity of internal feature maps in deep networks.
  • It enhances training by integrating a CKA-based loss, reducing mutual information and compressing representations to promote sparsity and efficiency.
  • The method has proven effective across sparse training, Bayesian ensembles, and federated learning, yielding improved accuracy and robustness.

CKA-SR Regularization is a family of spectral regularization techniques leveraging Centered Kernel Alignment (CKA) as a functional similarity metric for neural network representations. The core motivation is to directly intervene on the similarity structure of internal feature maps—across layers, models, or distributed learners—enabling improved sparsity, diversity, robustness, and transfer in deep learning. The most widely explored instantiations are CKA-SR for sparse training, for diversity in Bayesian ensembles, and for representation alignment in federated learning.

1. Mathematical Foundations of Centered Kernel Alignment

Centered Kernel Alignment (CKA) is a normalized similarity measure for comparing sets of representations. For feature maps XRn×p1X \in \mathbb R^{n \times p_1} and YRn×p2Y \in \mathbb R^{n \times p_2} (each row an example), CKA operates via Gram matrices:

  • Kij=k(xi,xj)K_{ij} = k(x_i, x_j) and Lij=l(yi,yj)L_{ij} = l(y_i, y_j).
  • Centering: for KK, use H=I1n11H = I - \frac{1}{n} 11^\top, forming Kˉ=HKH\bar K = H K H.
  • Hilbert–Schmidt Independence Criterion: HSIC(K,L)=YXF2\mathrm{HSIC}(K,L) = \|Y^\top X\|_F^2 for the linear kernel k(u,v)=uvk(u,v) = u^\top v.
  • The normalized CKA:

CKALinear(X,Y)=YXF2XXFYYF\mathrm{CKA}_{\text{Linear}}(X, Y) = \frac{\|Y^\top X\|_F^2}{\|X^\top X\|_F \cdot \|Y^\top Y\|_F}

This yields a value in [0,1][0,1], reflecting representational similarity invariant to orthogonal transformations and isotropic scaling (Ni et al., 2023, Smerkous et al., 31 Oct 2024, Son et al., 2021).

2. CKA-SR Loss Construction and Sparse Training

CKA-SR was introduced to induce sparsity by minimizing inter-layer feature similarity in deep networks (Ni et al., 2023). The regularizer is formulated as:

LC=s=1S0i<jNswijCKALinear(Xi,Xj)\mathcal L_{\mathcal C} = \sum_{s=1}^S \sum_{0 \leq i < j \leq N_s} w_{ij}\, \mathrm{CKA}_{\text{Linear}}(X_i, X_j)

where XiX_i are layer features, and wijw_{ij} are layer-pair weights. The total loss is:

Ltotal=LE+βLC\mathcal L_{\rm total} = \mathcal L_{\mathcal E} + \beta \mathcal L_{\mathcal C}

Here, β\beta is a regularization parameter tuned in [105,103][10^{-5}, 10^{-3}]. Typical layer pairs include adjacent or within-stage layers, reducing computational overhead to O(L)O(L) in network depth. CKA is estimated on a small “few-shot” subset (m8m \approx 8 samples per batch) for efficiency. CKA-SR can be combined with L0L_0 or L1L_1 weight regularization.

The adoption of CKA-SR in sparse learning is theoretically grounded by the information bottleneck principle (Ni et al., 2023):

  • For Gaussian features, minimizing inter-layer CKA aligns with minimizing the mutual information I(X;X^)I(X; \hat X).
  • Reducing I(X;X^)I(X; \hat X) (compression) enforces sparsity by driving down WF2\|W\|_F^2 for the layer’s weight matrix WW, maximizing the number of zeros (“ϵ\epsilon-sparsity”).
  • The established chain is:

minLCminI(X;X^)minWF2increased sparsity of W\min \mathcal L_{\mathcal C} \Rightarrow \min I(X ; \hat X) \Rightarrow \min \|W\|_F^2 \Rightarrow \text{increased sparsity of } W

This direct connection enables layer-wise control of both compressive representations and parameter sparsity.

4. CKA-SR Extensions: FedCKA and Hyperspherical Repulsion

CKA-based regularization extends to federated and Bayesian deep learning scenarios.

FedCKA: Federated Learning on Heterogeneous Data

FedCKA applies CKA regularization to federated setups, aligning representations of selected “important” layers (typically first two convolutional layers or initial blocks) between local models and the global server model (Son et al., 2021). The local objective combines supervised loss and a CKA-based contrastive term:

  • For layer nSn \in S, the key quantity is:

cka,n(x)=log(expCKA(ali,nt,ag,nt)expCKA(ali,nt,ag,nt)+expCKA(ali,nt,ali,nt1))\ell_{cka, n}(x) = -\log\left(\frac{\exp \mathrm{CKA}(a^{t}_{l_i, n}, a^{t}_{g, n})}{\exp \mathrm{CKA}(a^{t}_{l_i, n}, a^{t}_{g, n}) + \exp \mathrm{CKA}(a^{t}_{l_i, n}, a^{t-1}_{l_i, n})}\right)

Regularization is restricted to early layers, avoiding constraints on task-specialized deeper layers, and preserving scalability and efficiency.

CKA-SR with Hyperspherical Energy in Bayesian Ensembles

Smerkous et al. introduce hyperspherical energy (HE) minimization over CKA-centered Gram vectors to enforce diversity of particle-based Bayesian deep ensembles (Smerkous et al., 31 Oct 2024):

  • Gram matrices of layer features are centered, flattened, and normalized to sit on the unit sphere.
  • HE is computed using geodesic (arccosine) distance, yielding an energy sum over all model pairs and layers:

EHE=1LM(M1)l=1Lmm(dmm)sE_{HE} = \frac{1}{L M (M-1)} \sum_{l=1}^L \sum_{m \neq m'} (d_{mm'})^{-s}

where dmm=arccos(KˉlmKˉlm)d_{mm'} = \arccos(\bar K_l^m \cdot \bar K_l^{m'}).

  • Gradients are non-vanishing even as models become similar, ensuring repulsive diversity when standard cosine repulsion may fail.
  • The framework also incorporates OOD-specific HE and entropy terms to penalize overconfident predictions on synthetic outliers.
  • Typical hyperparameters: M=510M=5-10 (ensemble size), s=2s=2, γin[0.25,1.5]\gamma_{\text{in}} \in [0.25, 1.5], γood[0.5,2.0]\gamma_{\text{ood}} \in [0.5, 2.0], β[0.01,10]\beta \in [0.01, 10].

5. Computational Strategies and Algorithmic Procedures

CKA-SR methods maintain scalability via several implementation tactics:

  • Use linear CKA (Frobenius norm based) for efficient computation: O(m(pipj+pi2+pj2))O(m (p_i p_j + p_i^2 + p_j^2)) for mm samples and (i,j)(i,j) layer pair.
  • Restrict to intra-stage or adjacent pairs of layers, keeping regularization terms linear in depth.
  • In federated learning (FedCKA), only selected layers are regularized, with overall per-round overhead ≈17% relative to FedAvg on deep models.
  • In Bayesian ensemble settings, gradient stability is maintained via smoothers (ϵarc\epsilon_{\text{arc}}, ϵdist\epsilon_{\text{dist}}) and weighted layer contributions.

1
2
3
4
5
6
7
8
9
10
11
12
13
Given: network f with L layers, loss ℒ_E, reg-weight β, set of layer indices 𝒫={(i,j)}, few-shot sample size m.
Initialize weights W.
for each training step do
  (1) Sample mini-batch of size B.
  (2) Randomly choose m ≪ B examples to compute CKA.
  (3) Forward pass to obtain activations {X₀,…,X_L}.
  (4) Compute ℒ_E on full batch.
  (5) For each (i,j) ∈ 𝒫:
        extract sub-activations X_i ∈ ℝ^{m×p_i}, X_j ∈ ℝ^{m×p_j}
        compute CKA_{Linear}(X_i,X_j).
      Sum ℒ_C = ∑_{(i,j)} CKA_{Linear}(X_i,X_j).
  (6) Total loss ℒ = ℒ_E + β·ℒ_C.
  (7) Backpropagate and update W.
Equivalent pseudocode is detailed for FedCKA (Son et al., 2021) and ensemble HE-minimization (Smerkous et al., 31 Oct 2024).

6. Empirical Results and Application Domains

CKA-SR regularization yields quantifiable improvements in multiple regimes.

On CIFAR-100 (ResNet32/20), accuracy improvements are consistent at high sparsity. | Method | 70% | 85% | 90% | 95% | 98% | 99.8% | |---------|-----|-----|-----|-----|-----|-------| | LTH | 72.28 | 70.64 | 69.63 | 66.48 | 60.22 | × | | LTH+CKA-SR | 72.67 | 71.90 | 70.11 | 67.07 | 60.36 | × |

Combinations with filter/channel pruning show smaller accuracy drops for pruned models pretrained with CKA-SR.

HE minimization over CKA kernels offers improved uncertainty quantification and OOD detection. Example (CIFAR-10 vs SVHN, ResNet32, M=10):

  • SVGD+RBF: AUROC_PE ≈ 82.5%
  • SVGD+HE: AUROC_PE ≈ 89.2%
  • Ensemble+OOD_HE: AUROC_PE ≈ 96.5%, MI ≈ 96.6%

FedCKA achieves consistently higher accuracy versus prior state-of-the-art on CIFAR-10, CIFAR-100, and Tiny-ImageNet under heterogeneity, with a modest computational overhead.

7. Best Practices and Implementation Insights

  • In sparse training, CKA-SR is most effective when combined with standard L0L_0 or L1L_1 penalties; β\beta in [105,103][10^{-5}, 10^{-3}] is typical, and overly large β\beta may degrade useful representational flow.
  • Apply CKA regularization only to layers where feature similarity is beneficial (early convolutional blocks in federated learning).
  • In ensemble settings, use hyperspherical repulsion for stable diversity, and separate inlier/OOD terms for comprehensive calibration.
  • CKA-SR requires only intermediate feature maps and is “plug-and-play” with standard training flows; no modifications to pruning schedules or communication protocols are necessary.

CKA-SR regularization systematically broadens representational diversity and compressive efficiency in deep models, with theoretical guarantees via the information bottleneck, empirical robustness across training paradigms, and pragmatic guidelines for scalable implementation (Ni et al., 2023, Smerkous et al., 31 Oct 2024, Son et al., 2021).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to CKA-SR Regularization.