Local CKA-Based Retrieval

Updated 4 February 2026

Local CKA-based retrieval is a training-free method that uses CKA to measure instance-level semantic consistency between image and caption embeddings.
It augments a small seed set of aligned pairs to compute local CKA scores, enabling fine-grained, cross-domain matching without relying on global dataset alignment.
The technique demonstrates robust performance in 0-shot tasks and cross-lingual settings, though its computational complexity requires optimization for large-scale applications.

Local CKA-based retrieval is a training-free method for matching or retrieving semantically paired data—such as images and captions—across modalities using only a small seed set of aligned examples. The approach leverages Centered Kernel Alignment (CKA) as a statistical metric to quantify, for each query pair, how much augmenting a base graph of aligned embeddings with that pair perturbs the overall modality-invariant geometry. Unlike global methods, which assess the alignment of entire datasets, local CKA evaluates fine-grained, instance-level consistency, enabling robust matching and retrieval across domains, languages, and even 0-shot classification without additional training or explicit alignment (Maniparambil et al., 2024).

1. Foundations: Global CKA and its Limitations

Global CKA is a normalized similarity metric for assessing the agreement of pairwise geometries between two sets of representations $X \in \mathbb{R}^{d_1 \times N}$ and $Y \in \mathbb{R}^{d_2 \times N}$ . Given kernels $K = k(X^\top, X) \in \mathbb{R}^{N \times N}$ and $L = \ell(Y^\top, Y)$ (with $k, \ell$ commonly linear or RBF), both kernels are centered with the centering matrix $H = I_N - (1/N) \mathbf{1}_N \mathbf{1}_N^\top$ . The Hilbert–Schmidt Independence Criterion (HSIC) is

$\mathrm{HSIC}(K,L) = \frac{1}{(N-1)^2} \mathrm{Tr}(H K H L)$

CKA is then defined as

$\mathrm{CKA}(K,L) = \frac{\mathrm{HSIC}(K,L)}{\sqrt{\mathrm{HSIC}(K,K)\,\mathrm{HSIC}(L,L)}}$

yielding a value in $[0, 1]$ . Global CKA, however, is limited: it provides only a global measure of dataset-level alignment. It cannot evaluate the quality of individual image–caption pairs, rendering it insufficient for fine-grained retrieval scenarios (Maniparambil et al., 2024).

2. Localized CKA: Definition and Scoring Procedure

Localized CKA (localCKA) enables pairwise evaluation by measuring the compatibility of a single query image–caption pair with the geometry defined by a "base" set of $M$ aligned pairs:

Base set: $B = \{(x_i^b, c_i^b)\}_{i=1}^M$ , with $X^b \in \mathbb{R}^{d_1 \times M}$ and $C^b \in \mathbb{R}^{d_2 \times M}$ .
Query: given image $x_j^q$ and caption $c_k^q$ .

The localCKA score is computed as:

$\mathrm{localCKA}\bigl(x_j^q, c_k^q\bigr) = \mathrm{CKA}\Big(k\left([X^b, x_j^q]^\top, [X^b, x_j^q]\right),\; \ell\left([C^b, c_k^q]^\top, [C^b, c_k^q]\right)\Big)$

This augments the base image and caption features with the candidate pair and computes the global CKA over the resulting $(M+1)$ points. A high score indicates the candidate pair maintains the semantic structure of the seed graph, suggesting a correct image–caption match (Maniparambil et al., 2024).

3. Retrieval and Matching Algorithms

There are two key paradigms: global (seeded QAP) and local (localCKA) matching.

Seeded Quadratic Assignment Problem (QAP):

Concatenate base and query sets: $X = [X^b, X^q]$ , $C = [C^b, C^q]$ .
Compute centered, rescaled kernels $\bar K_X$ , $\bar K_C$ .
Seek permutation $P$ on $N$ query elements that, in block-diagonal $I_M \oplus P$ , maximizes $\mathrm{Tr}((I_M \oplus P)^\top \bar K_X (I_M \oplus P)\bar K_C )$ .
Fast seeded QAP solvers can be used for efficient approximate solution.

LocalCKA-based Retrieval or Matching:
- For each query image $x_i^q$ and caption $c_j^q$ , compute
$K_i = k\bigl([X^b, x_i^q]^\top, [X^b, x_i^q]\bigr), \quad L_j = \ell\bigl([C^b, c_j^q]^\top, [C^b, c_j^q]\bigr)$

then compute $R_{i,j} = \mathrm{CKA}(K_i, L_j)$ . - The retrieval for image $i$ is $\arg\max_j R_{i,j}$ . - For full $N \times N$ matching, assemble $R$ and recover a global permutation via the Hungarian algorithm (linear-sum assignment) on $-R$ .
For both, a carefully chosen seed set is essential; k-means clustering on embeddings is an effective selection strategy. Feature-wise variance normalization ("stretching") is reported to enhance performance across approaches (Maniparambil et al., 2024).

4. Graph-Theoretic Interpretation and Seed Usage

Data from each modality (images or captions) is represented as a graph with $M+N$ nodes corresponding to base and query embeddings. Edges are weighted by centered kernel values. The $M$ seeds (the base set) are fixed one-to-one matchings between modalities, serving as anchors in the matching process.

For seeded QAP, the permutation aligns query nodes to maximize agreement of edge weights between graphs, subject to the seeds being fixed.
For localCKA, the approach does not attempt a global matching but evaluates each candidate pairing by measuring how much its introduction into the seed graph distorts inter-node relations, as quantified by CKA (Maniparambil et al., 2024).

5. Empirical Results and Benchmarking

Experiments span multiple vision and language encoders, modalities, and correspondence tasks:

Task/Setting	Metric/Method	CLIP-ViT	DINOv2	ConvNeXt	Baselines
COCO→NoCaps cross-domain matching	QAP@1	67%	58%	—	Linear reg (∼50%), rel. (∼45%)
COCO→NoCaps cross-domain retrieval	LocalCKA@5	60.5%	61.8%	—	—
COCO val in-domain retrieval	LocalCKA@5	∼70%	∼70%	—	rel. (∼65%)
ImageNet-100 zero-shot classification	LocalCKA top-1	—	67.7%	83.3%	CLIP (86.1%)
XTD-10 cross-lingual	LocalCKA@5	—	—	—	CLIP-cos (0% non-Latin)

Additional qualitative observations show that even incorrect LocalCKA retrievals tend to be semantically similar to the target caption. Notably, LocalCKA outperforms CLIP-cosine baselines in cross-lingual settings, with CLIP-cosine collapsing (retrieval $\to0\%$ ) on non-Latin alphabets, while LocalCKA achieves $\sim58\%$ average, representing a $+17\%$ improvement over CLIP (Maniparambil et al., 2024).

6. Complexity, Practicalities, and Limitations

Computational Complexity:
- Seeded QAP (fast approximate): $\mathcal{O}(N^3)$ . On $N=500$ queries, $M=320$ seeds, computation is $\sim40$ s on CPU.
- Naive LocalCKA: $N^2$ pairings, each evaluating CKA on $(M+1)\times(M+1)$ kernels $\Rightarrow \mathcal{O}(N^2 M^3) \approx \mathcal{O}(N^4)$ if $M\sim N$ ; practical runtime $\sim5$ min on GPU.
- Baselines: Relative ( $\mathcal{O}(N^2)$ ), Linear regression ( $\mathcal{O}(N d)$ ).
Practical Considerations:
- Seed set size/control (e.g., $M\sim300$ ) and careful selection via k-means are critical for robust performance.
- Feature normalization ("stretching") substantially improves LocalCKA and QAP.
- LocalCKA’s computational overhead for large $N$ can be mitigated by reusing precomputed HSIC terms and employing incremental updates, yielding possible $\mathcal{O}(N^3)$ scaling.
Limitations:
- LocalCKA’s $\mathcal{O}(N^4)$ complexity is prohibitive for large-scale settings without optimization.
- The choice of kernel (linear or RBF) and normalization strategy must be tuned per application.
- Method requires a (possibly unlabeled) base set of parallel image–caption pairs.

7. Extensions and Future Directions

The framework is extensible along several axes:

Employing approximate CKA computation (e.g., via random Fourier features) to reduce local computation cost.
Introducing a learnable neural “prompt” layered atop the query augmentation to further enhance cross-modal alignment.
Substituting seeded QAP with other graph-matching relaxations, such as Gromov–Wasserstein, to improve global scalability and potentially adapt to even broader domains (Maniparambil et al., 2024).

In summary, local CKA-based retrieval augments a seed graph of aligned embeddings with candidate query pairs, scoring them by how little they distort the base structural geometry as measured by CKA. This training-free technique is robust across domains, languages, and image classification scenarios, and provides a modular alternative to supervised alignment for vision–language pairing and retrieval tasks (Maniparambil et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Do Vision and Language Encoders Represent the World Similarly? (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Local CKA-Based Retrieval.