Local CKA-Based Retrieval
- Local CKA-based retrieval is a training-free method that uses CKA to measure instance-level semantic consistency between image and caption embeddings.
- It augments a small seed set of aligned pairs to compute local CKA scores, enabling fine-grained, cross-domain matching without relying on global dataset alignment.
- The technique demonstrates robust performance in 0-shot tasks and cross-lingual settings, though its computational complexity requires optimization for large-scale applications.
Local CKA-based retrieval is a training-free method for matching or retrieving semantically paired data—such as images and captions—across modalities using only a small seed set of aligned examples. The approach leverages Centered Kernel Alignment (CKA) as a statistical metric to quantify, for each query pair, how much augmenting a base graph of aligned embeddings with that pair perturbs the overall modality-invariant geometry. Unlike global methods, which assess the alignment of entire datasets, local CKA evaluates fine-grained, instance-level consistency, enabling robust matching and retrieval across domains, languages, and even 0-shot classification without additional training or explicit alignment (Maniparambil et al., 2024).
1. Foundations: Global CKA and its Limitations
Global CKA is a normalized similarity metric for assessing the agreement of pairwise geometries between two sets of representations and . Given kernels and (with commonly linear or RBF), both kernels are centered with the centering matrix . The Hilbert–Schmidt Independence Criterion (HSIC) is
CKA is then defined as
yielding a value in . Global CKA, however, is limited: it provides only a global measure of dataset-level alignment. It cannot evaluate the quality of individual image–caption pairs, rendering it insufficient for fine-grained retrieval scenarios (Maniparambil et al., 2024).
2. Localized CKA: Definition and Scoring Procedure
Localized CKA (localCKA) enables pairwise evaluation by measuring the compatibility of a single query image–caption pair with the geometry defined by a "base" set of aligned pairs:
- Base set: , with and .
- Query: given image and caption .
The localCKA score is computed as:
This augments the base image and caption features with the candidate pair and computes the global CKA over the resulting points. A high score indicates the candidate pair maintains the semantic structure of the seed graph, suggesting a correct image–caption match (Maniparambil et al., 2024).
3. Retrieval and Matching Algorithms
There are two key paradigms: global (seeded QAP) and local (localCKA) matching.
- Seeded Quadratic Assignment Problem (QAP):
- Concatenate base and query sets: , .
- Compute centered, rescaled kernels , .
- Seek permutation on query elements that, in block-diagonal , maximizes .
- Fast seeded QAP solvers can be used for efficient approximate solution.
- LocalCKA-based Retrieval or Matching:
- For each query image and caption , compute
then compute . - The retrieval for image is . - For full matching, assemble and recover a global permutation via the Hungarian algorithm (linear-sum assignment) on .
For both, a carefully chosen seed set is essential; k-means clustering on embeddings is an effective selection strategy. Feature-wise variance normalization ("stretching") is reported to enhance performance across approaches (Maniparambil et al., 2024).
4. Graph-Theoretic Interpretation and Seed Usage
Data from each modality (images or captions) is represented as a graph with nodes corresponding to base and query embeddings. Edges are weighted by centered kernel values. The seeds (the base set) are fixed one-to-one matchings between modalities, serving as anchors in the matching process.
For seeded QAP, the permutation aligns query nodes to maximize agreement of edge weights between graphs, subject to the seeds being fixed.
For localCKA, the approach does not attempt a global matching but evaluates each candidate pairing by measuring how much its introduction into the seed graph distorts inter-node relations, as quantified by CKA (Maniparambil et al., 2024).
5. Empirical Results and Benchmarking
Experiments span multiple vision and language encoders, modalities, and correspondence tasks:
| Task/Setting | Metric/Method | CLIP-ViT | DINOv2 | ConvNeXt | Baselines |
|---|---|---|---|---|---|
| COCO→NoCaps cross-domain matching | QAP@1 | 67% | 58% | — | Linear reg (∼50%), rel. (∼45%) |
| COCO→NoCaps cross-domain retrieval | LocalCKA@5 | 60.5% | 61.8% | — | — |
| COCO val in-domain retrieval | LocalCKA@5 | ∼70% | ∼70% | — | rel. (∼65%) |
| ImageNet-100 zero-shot classification | LocalCKA top-1 | — | 67.7% | 83.3% | CLIP (86.1%) |
| XTD-10 cross-lingual | LocalCKA@5 | — | — | — | CLIP-cos (0% non-Latin) |
Additional qualitative observations show that even incorrect LocalCKA retrievals tend to be semantically similar to the target caption. Notably, LocalCKA outperforms CLIP-cosine baselines in cross-lingual settings, with CLIP-cosine collapsing (retrieval ) on non-Latin alphabets, while LocalCKA achieves average, representing a improvement over CLIP (Maniparambil et al., 2024).
6. Complexity, Practicalities, and Limitations
Computational Complexity:
- Seeded QAP (fast approximate): . On queries, seeds, computation is s on CPU.
- Naive LocalCKA: pairings, each evaluating CKA on kernels if ; practical runtime min on GPU.
- Baselines: Relative (), Linear regression ().
- Practical Considerations:
- Seed set size/control (e.g., ) and careful selection via k-means are critical for robust performance.
- Feature normalization ("stretching") substantially improves LocalCKA and QAP.
- LocalCKA’s computational overhead for large can be mitigated by reusing precomputed HSIC terms and employing incremental updates, yielding possible scaling.
- Limitations:
- LocalCKA’s complexity is prohibitive for large-scale settings without optimization.
- The choice of kernel (linear or RBF) and normalization strategy must be tuned per application.
- Method requires a (possibly unlabeled) base set of parallel image–caption pairs.
7. Extensions and Future Directions
The framework is extensible along several axes:
- Employing approximate CKA computation (e.g., via random Fourier features) to reduce local computation cost.
- Introducing a learnable neural “prompt” layered atop the query augmentation to further enhance cross-modal alignment.
- Substituting seeded QAP with other graph-matching relaxations, such as Gromov–Wasserstein, to improve global scalability and potentially adapt to even broader domains (Maniparambil et al., 2024).
In summary, local CKA-based retrieval augments a seed graph of aligned embeddings with candidate query pairs, scoring them by how little they distort the base structural geometry as measured by CKA. This training-free technique is robust across domains, languages, and image classification scenarios, and provides a modular alternative to supervised alignment for vision–language pairing and retrieval tasks (Maniparambil et al., 2024).