Papers
Topics
Authors
Recent
Search
2000 character limit reached

RealSI Benchmark Overview

Updated 6 December 2025
  • RealSI Benchmark is a comprehensive framework that compares neural network representations using prediction- and design-grounded tests.
  • It systematically ranks 23 similarity measures through six diverse evaluations, showcasing metric reliability and performance trade-offs across graph, language, and vision domains.
  • The platform supports reproducible research with open-source code and extensible protocols, enabling detailed analysis of architectures and dataset-specific behaviors.

The Representational Similarity (ReSi) Benchmark provides a comprehensive evaluation framework for comparing internal representations learned by neural architectures across domains. By systematically grounding the notion of similarity in both prediction-based and design-based conditions, ReSi offers rigorous protocols for ranking 23 state-of-the-art representational similarity measures across six well-specified tests. This extensible platform enables reproducible research in graph, language, and vision domains while highlighting strengths and weaknesses of popular methods under diverse scenarios (Klabunde et al., 2024).

1. Motivation and Conceptual Goals

Comparing neural network representations is foundational to interpretability, transfer learning, and architecture analysis. Prior work introduced numerous similarity measures—spanning canonical correlation (SVCCA, PWCCA), kernel methods (CKA), neighborhood-based metrics (2nd-Cos, Jaccard), and topological distances (IMD)—yet lacked a unified benchmark to establish ground truth expectations, diverse model coverage, and robust, reproducible protocols. ReSi fills this gap by:

  • Grounding representational similarity in practically checkable contexts, distinguishing cases where two feature sets should be deemed similar versus dissimilar.
  • Ranking similarity measures using six meticulously constructed tests, spanning prediction accuracy, output divergence, label permutation, shortcut exploitation, data augmentation, and intranetwork layer monotonicity.
  • Enabling extensibility across domains and architectures and providing reproducible code for all evaluation protocols.

This approach systematically exposes which measures are consistent with theoretically grounded criteria, clarifying the circumstances under which a metric is reliable or prone to failure.

2. Six Benchmarking Tests

ReSi divides its evaluation suite into prediction-grounded and design-grounded tests. All are based on extracting the final inner-layer representations, denoted RRN×DR\in\mathbb{R}^{N\times D}, from trained models ff on a fixed test set X\bm X.

Test Name Grounding Metric(s)
Correlation to Accuracy Difference Prediction Spearman-ρ\rho
Correlation to Output Difference Prediction Spearman-ρ\rho
Label Randomization Design AUPRC, conformity rate
Shortcut Features Design AUPRC, conformity rate
Data Augmentation Design AUPRC, conformity rate
Layer Monotonicity Design Conformity rate, Spearman

Test 1: Correlation to Accuracy Difference

If two models' test accuracies differ by Δacc\Delta_{\mathrm{acc}}, their representations R,RR, R' should reflect this—quantified via Spearman correlation between measure m(R,R)m(R, R') and accuracy gap.

Test 2: Correlation to Output Difference

Models with identical accuracy can disagree on individual predictions. This test computes score matrices and disagreement metrics, evaluating Spearman correlation against m(R,R)m(R, R') to test sensitivity to granular output changes.

Test 3: Label Randomization

By training with varying fractions of randomized labels, representation groups Gi\mathcal{G}_i are formed. Measures should yield higher similarity within groups; evaluated via AUPRC and conformity rate.

Test 4: Shortcut Features

Inducing spurious correlation through shortcut features forms distinct groups by shortcut strength. Group-based AUPRC and conformity rates measure a metric’s capacity to distinguish “shortcut” representations.

Test 5: Data Augmentation

Models trained with increasing augmentation strengths are grouped accordingly. Metrics should reflect intra-group similarity and inter-group differences per AUPRC and conformity rate.

Test 6: Layer Monotonicity

Within a single model, representations from deeper layers should become progressively less similar. This is tested by conformity rate of monotonicity inequalities and Spearman correlation versus layer distance.

3. Similarity Measures Catalog

ReSi evaluates 23 similarity measures, each with explicit mathematical formulation, invariance properties, and computational cost. These span alignment-based, CCA-based, representational similarity matrix (RSM), neighborhood-based, statistics-based, and topology-based paradigms:

Class Measure Formula (LaTeX) Example
Alignment-based HardCorr m(R,R)=1Dd=1Dcos(R,d,R,d)m(R,R') = \frac{1}{D} \sum_{d=1}^D \cos(R_{\bullet,d}, R'_{\bullet,d})
OrthProc m(R,R)=trace(RRW)m(R,R') = \operatorname{trace}(R^\top R' W^*), WW^* orthogonal
CCA-based SVCCA, PWCCA m(R,R)=1kj=1kρjm(R,R') = \frac{1}{k} \sum_{j=1}^k \rho_j
RSM-based CKA m(R,R)=trace(KK)KFKFm(R,R') = \frac{\operatorname{trace}(K K')}{\|K\|_F\,\|K'\|_F}
DistCorr distCorr(A,A)\mathrm{distCorr}(A, A')
Neighborhood-based 2nd-Cos, Jaccard Cosine/Jaccard over kk-NN graphs
Topology-based IMD Approximates Gromov–Hausdorff distance

Notable invariance and cost distinctions include, for example, CKA being invariant to orthonormal transforms and requiring O(N2D)O(N^2D) complexity; 2nd-Cos preserving local topological properties; and IMD capturing global manifold structure.

4. Architectures and Datasets

ReSi covers eleven neural architectures over six representative datasets, facilitating domain-specific analyses and extensibility:

Domain Architectures Datasets
Graph GCN, GraphSAGE, GAT Cora, Flickr, OGBN-Arxiv
Language BERT-Base (25 Multiberts seeds, fine-tuned) SST2, MNLI
Vision ResNet-18/34/101, VGG-11/19, ViT-B/32, ViT-L/32 ImageNet-100

This integrative matrix enables comparison of metric behavior across classical, attention-based, and transformer architectures over node classification, sequence/NLI, and large-scale image tasks. The benchmark is extensible, supporting “14×7” architecture/dataset pairs.

5. Experimental Insights and Measure Performance

Cross-domain benchmarking yields several key findings:

  • No measure universally dominates; performance is regime-dependent. Measures highly ranked in one domain or test often fail in others.
  • Graph domain: Neighborhood-based measures (2nd-Cos, Jaccard, RankSim) excel.
  • Language domain: Alignment/angular measures (OrthProc, AlignCos, AngShape) perform best.
  • Vision domain: CKA and Procrustes variants demonstrate leading fidelity.
  • Popularity does not imply robustness; highly cited SVCCA and RSA do not consistently score in the top tier.
  • Orthogonal Procrustes (OrthProc) exhibits exceptional stability, never ranking in the lowest third in any domain.
  • There are pronounced trade-offs: CKA and DistCorr offer high fidelity but are computationally expensive; neighborhood metrics capture fine-grained variation but underperform on layer monotonicity.
  • In label randomization, RSMDiff and IMD achieve near-perfect AUPRC, while alignment measures may approach random chance; in layer monotonicity, EOS and DistCorr attain 100% conformity on CNNs, whereas statistics-based measures can give negative Spearman correlation.

6. Implementation and Reproducibility

ReSi is accompanied by an open-source codebase (github.com/mklabunde/resi) supporting reproducible training, evaluation, and representation extraction for all included domains:

  • Models are trained with established packages: PyG for GNNs, HuggingFace for BERT, and PyTorch for vision networks.
  • Full pretrained checkpoints and dumps of representation matrices are available.
  • Data splits adhere to standard protocols of respective fields.
  • Evaluation employs predefined metrics for each test (Spearman correlation, AUPRC, conformity rate), matching the official specifications.
  • Hyperparameters: k=10k=10 for neighborhood-based measures, linear kernels for CKA, standard λλ and step counts for GULP and IMD.
  • Compute resources include up to 80 GB GPUs and CPU clusters ranging to 1024 GB RAM.

This supports systematic reproduction and expansion of the benchmark for new models, datasets, or similarity measures.

7. Directions for Extension

Potential future advances for the ReSi framework include:

  • Introducing new measure classes (mutual-information-based, affine-aligned, learned similarity metrics).
  • Adding domains such as audio, reinforcement learning, and self-supervised learning.
  • Expanding architecture coverage to lightweight CNNs (e.g. MobileNet), large-scale LLMs (GPT-2/3), and graph transformers.
  • Developing new groundings such as style-bias assessment, out-of-distribution robustness, and privacy or leakage detection.
  • Conducting systematic hyperparameter sweeps (e.g. varying kk, kernel choices).
  • Integrating runtime and memory cost as explicit criterion in benchmark scores.

A plausible implication is that benchmarking representational similarity requires multidimensional evaluation, domain-aware selection of metrics, and explicit grounding in practical prediction and design contexts. ReSi provides a modular foundation for rigorous ongoing research in representational similarity analysis (Klabunde et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RealSI Benchmark.