Papers
Topics
Authors
Recent
Search
2000 character limit reached

Local CKA-Based Retrieval

Updated 4 February 2026
  • Local CKA-based retrieval is a training-free method that uses CKA to measure instance-level semantic consistency between image and caption embeddings.
  • It augments a small seed set of aligned pairs to compute local CKA scores, enabling fine-grained, cross-domain matching without relying on global dataset alignment.
  • The technique demonstrates robust performance in 0-shot tasks and cross-lingual settings, though its computational complexity requires optimization for large-scale applications.

Local CKA-based retrieval is a training-free method for matching or retrieving semantically paired data—such as images and captions—across modalities using only a small seed set of aligned examples. The approach leverages Centered Kernel Alignment (CKA) as a statistical metric to quantify, for each query pair, how much augmenting a base graph of aligned embeddings with that pair perturbs the overall modality-invariant geometry. Unlike global methods, which assess the alignment of entire datasets, local CKA evaluates fine-grained, instance-level consistency, enabling robust matching and retrieval across domains, languages, and even 0-shot classification without additional training or explicit alignment (Maniparambil et al., 2024).

1. Foundations: Global CKA and its Limitations

Global CKA is a normalized similarity metric for assessing the agreement of pairwise geometries between two sets of representations XRd1×NX \in \mathbb{R}^{d_1 \times N} and YRd2×NY \in \mathbb{R}^{d_2 \times N}. Given kernels K=k(X,X)RN×NK = k(X^\top, X) \in \mathbb{R}^{N \times N} and L=(Y,Y)L = \ell(Y^\top, Y) (with k,k, \ell commonly linear or RBF), both kernels are centered with the centering matrix H=IN(1/N)1N1NH = I_N - (1/N) \mathbf{1}_N \mathbf{1}_N^\top. The Hilbert–Schmidt Independence Criterion (HSIC) is

HSIC(K,L)=1(N1)2Tr(HKHL)\mathrm{HSIC}(K,L) = \frac{1}{(N-1)^2} \mathrm{Tr}(H K H L)

CKA is then defined as

CKA(K,L)=HSIC(K,L)HSIC(K,K)HSIC(L,L)\mathrm{CKA}(K,L) = \frac{\mathrm{HSIC}(K,L)}{\sqrt{\mathrm{HSIC}(K,K)\,\mathrm{HSIC}(L,L)}}

yielding a value in [0,1][0, 1]. Global CKA, however, is limited: it provides only a global measure of dataset-level alignment. It cannot evaluate the quality of individual image–caption pairs, rendering it insufficient for fine-grained retrieval scenarios (Maniparambil et al., 2024).

2. Localized CKA: Definition and Scoring Procedure

Localized CKA (localCKA) enables pairwise evaluation by measuring the compatibility of a single query image–caption pair with the geometry defined by a "base" set of MM aligned pairs:

  • Base set: B={(xib,cib)}i=1MB = \{(x_i^b, c_i^b)\}_{i=1}^M, with XbRd1×MX^b \in \mathbb{R}^{d_1 \times M} and CbRd2×MC^b \in \mathbb{R}^{d_2 \times M}.
  • Query: given image xjqx_j^q and caption ckqc_k^q.

The localCKA score is computed as:

localCKA(xjq,ckq)=CKA(k([Xb,xjq],[Xb,xjq]),  ([Cb,ckq],[Cb,ckq]))\mathrm{localCKA}\bigl(x_j^q, c_k^q\bigr) = \mathrm{CKA}\Big(k\left([X^b, x_j^q]^\top, [X^b, x_j^q]\right),\; \ell\left([C^b, c_k^q]^\top, [C^b, c_k^q]\right)\Big)

This augments the base image and caption features with the candidate pair and computes the global CKA over the resulting (M+1)(M+1) points. A high score indicates the candidate pair maintains the semantic structure of the seed graph, suggesting a correct image–caption match (Maniparambil et al., 2024).

3. Retrieval and Matching Algorithms

There are two key paradigms: global (seeded QAP) and local (localCKA) matching.

  1. Concatenate base and query sets: X=[Xb,Xq]X = [X^b, X^q], C=[Cb,Cq]C = [C^b, C^q].
  2. Compute centered, rescaled kernels KˉX\bar K_X, KˉC\bar K_C.
  3. Seek permutation PP on NN query elements that, in block-diagonal IMPI_M \oplus P, maximizes Tr((IMP)KˉX(IMP)KˉC)\mathrm{Tr}((I_M \oplus P)^\top \bar K_X (I_M \oplus P)\bar K_C ).
  4. Fast seeded QAP solvers can be used for efficient approximate solution.
  • LocalCKA-based Retrieval or Matching:
    • For each query image xiqx_i^q and caption cjqc_j^q, compute

    Ki=k([Xb,xiq],[Xb,xiq]),Lj=([Cb,cjq],[Cb,cjq])K_i = k\bigl([X^b, x_i^q]^\top, [X^b, x_i^q]\bigr), \quad L_j = \ell\bigl([C^b, c_j^q]^\top, [C^b, c_j^q]\bigr)

    then compute Ri,j=CKA(Ki,Lj)R_{i,j} = \mathrm{CKA}(K_i, L_j). - The retrieval for image ii is argmaxjRi,j\arg\max_j R_{i,j}. - For full N×NN \times N matching, assemble RR and recover a global permutation via the Hungarian algorithm (linear-sum assignment) on R-R.

  • For both, a carefully chosen seed set is essential; k-means clustering on embeddings is an effective selection strategy. Feature-wise variance normalization ("stretching") is reported to enhance performance across approaches (Maniparambil et al., 2024).

4. Graph-Theoretic Interpretation and Seed Usage

Data from each modality (images or captions) is represented as a graph with M+NM+N nodes corresponding to base and query embeddings. Edges are weighted by centered kernel values. The MM seeds (the base set) are fixed one-to-one matchings between modalities, serving as anchors in the matching process.

  • For seeded QAP, the permutation aligns query nodes to maximize agreement of edge weights between graphs, subject to the seeds being fixed.

  • For localCKA, the approach does not attempt a global matching but evaluates each candidate pairing by measuring how much its introduction into the seed graph distorts inter-node relations, as quantified by CKA (Maniparambil et al., 2024).

5. Empirical Results and Benchmarking

Experiments span multiple vision and language encoders, modalities, and correspondence tasks:

Task/Setting Metric/Method CLIP-ViT DINOv2 ConvNeXt Baselines
COCO→NoCaps cross-domain matching QAP@1 67% 58% Linear reg (∼50%), rel. (∼45%)
COCO→NoCaps cross-domain retrieval LocalCKA@5 60.5% 61.8%
COCO val in-domain retrieval LocalCKA@5 ∼70% ∼70% rel. (∼65%)
ImageNet-100 zero-shot classification LocalCKA top-1 67.7% 83.3% CLIP (86.1%)
XTD-10 cross-lingual LocalCKA@5 CLIP-cos (0% non-Latin)

Additional qualitative observations show that even incorrect LocalCKA retrievals tend to be semantically similar to the target caption. Notably, LocalCKA outperforms CLIP-cosine baselines in cross-lingual settings, with CLIP-cosine collapsing (retrieval 0%\to0\%) on non-Latin alphabets, while LocalCKA achieves 58%\sim58\% average, representing a +17%+17\% improvement over CLIP (Maniparambil et al., 2024).

6. Complexity, Practicalities, and Limitations

  • Computational Complexity:

    • Seeded QAP (fast approximate): O(N3)\mathcal{O}(N^3). On N=500N=500 queries, M=320M=320 seeds, computation is 40\sim40s on CPU.
    • Naive LocalCKA: N2N^2 pairings, each evaluating CKA on (M+1)×(M+1)(M+1)\times(M+1) kernels O(N2M3)O(N4)\Rightarrow \mathcal{O}(N^2 M^3) \approx \mathcal{O}(N^4) if MNM\sim N; practical runtime 5\sim5min on GPU.
    • Baselines: Relative (O(N2)\mathcal{O}(N^2)), Linear regression (O(Nd)\mathcal{O}(N d)).
  • Practical Considerations:
    • Seed set size/control (e.g., M300M\sim300) and careful selection via k-means are critical for robust performance.
    • Feature normalization ("stretching") substantially improves LocalCKA and QAP.
    • LocalCKA’s computational overhead for large NN can be mitigated by reusing precomputed HSIC terms and employing incremental updates, yielding possible O(N3)\mathcal{O}(N^3) scaling.
  • Limitations:
    • LocalCKA’s O(N4)\mathcal{O}(N^4) complexity is prohibitive for large-scale settings without optimization.
    • The choice of kernel (linear or RBF) and normalization strategy must be tuned per application.
    • Method requires a (possibly unlabeled) base set of parallel image–caption pairs.

7. Extensions and Future Directions

The framework is extensible along several axes:

  • Employing approximate CKA computation (e.g., via random Fourier features) to reduce local computation cost.
  • Introducing a learnable neural “prompt” layered atop the query augmentation to further enhance cross-modal alignment.
  • Substituting seeded QAP with other graph-matching relaxations, such as Gromov–Wasserstein, to improve global scalability and potentially adapt to even broader domains (Maniparambil et al., 2024).

In summary, local CKA-based retrieval augments a seed graph of aligned embeddings with candidate query pairs, scoring them by how little they distort the base structural geometry as measured by CKA. This training-free technique is robust across domains, languages, and image classification scenarios, and provides a modular alternative to supervised alignment for vision–language pairing and retrieval tasks (Maniparambil et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Local CKA-Based Retrieval.