RealSI Benchmark Overview
- RealSI Benchmark is a comprehensive framework that compares neural network representations using prediction- and design-grounded tests.
- It systematically ranks 23 similarity measures through six diverse evaluations, showcasing metric reliability and performance trade-offs across graph, language, and vision domains.
- The platform supports reproducible research with open-source code and extensible protocols, enabling detailed analysis of architectures and dataset-specific behaviors.
The Representational Similarity (ReSi) Benchmark provides a comprehensive evaluation framework for comparing internal representations learned by neural architectures across domains. By systematically grounding the notion of similarity in both prediction-based and design-based conditions, ReSi offers rigorous protocols for ranking 23 state-of-the-art representational similarity measures across six well-specified tests. This extensible platform enables reproducible research in graph, language, and vision domains while highlighting strengths and weaknesses of popular methods under diverse scenarios (Klabunde et al., 2024).
1. Motivation and Conceptual Goals
Comparing neural network representations is foundational to interpretability, transfer learning, and architecture analysis. Prior work introduced numerous similarity measures—spanning canonical correlation (SVCCA, PWCCA), kernel methods (CKA), neighborhood-based metrics (2nd-Cos, Jaccard), and topological distances (IMD)—yet lacked a unified benchmark to establish ground truth expectations, diverse model coverage, and robust, reproducible protocols. ReSi fills this gap by:
- Grounding representational similarity in practically checkable contexts, distinguishing cases where two feature sets should be deemed similar versus dissimilar.
- Ranking similarity measures using six meticulously constructed tests, spanning prediction accuracy, output divergence, label permutation, shortcut exploitation, data augmentation, and intranetwork layer monotonicity.
- Enabling extensibility across domains and architectures and providing reproducible code for all evaluation protocols.
This approach systematically exposes which measures are consistent with theoretically grounded criteria, clarifying the circumstances under which a metric is reliable or prone to failure.
2. Six Benchmarking Tests
ReSi divides its evaluation suite into prediction-grounded and design-grounded tests. All are based on extracting the final inner-layer representations, denoted , from trained models on a fixed test set .
| Test Name | Grounding | Metric(s) |
|---|---|---|
| Correlation to Accuracy Difference | Prediction | Spearman- |
| Correlation to Output Difference | Prediction | Spearman- |
| Label Randomization | Design | AUPRC, conformity rate |
| Shortcut Features | Design | AUPRC, conformity rate |
| Data Augmentation | Design | AUPRC, conformity rate |
| Layer Monotonicity | Design | Conformity rate, Spearman |
Test 1: Correlation to Accuracy Difference
If two models' test accuracies differ by , their representations should reflect this—quantified via Spearman correlation between measure and accuracy gap.
Test 2: Correlation to Output Difference
Models with identical accuracy can disagree on individual predictions. This test computes score matrices and disagreement metrics, evaluating Spearman correlation against to test sensitivity to granular output changes.
Test 3: Label Randomization
By training with varying fractions of randomized labels, representation groups are formed. Measures should yield higher similarity within groups; evaluated via AUPRC and conformity rate.
Test 4: Shortcut Features
Inducing spurious correlation through shortcut features forms distinct groups by shortcut strength. Group-based AUPRC and conformity rates measure a metric’s capacity to distinguish “shortcut” representations.
Test 5: Data Augmentation
Models trained with increasing augmentation strengths are grouped accordingly. Metrics should reflect intra-group similarity and inter-group differences per AUPRC and conformity rate.
Test 6: Layer Monotonicity
Within a single model, representations from deeper layers should become progressively less similar. This is tested by conformity rate of monotonicity inequalities and Spearman correlation versus layer distance.
3. Similarity Measures Catalog
ReSi evaluates 23 similarity measures, each with explicit mathematical formulation, invariance properties, and computational cost. These span alignment-based, CCA-based, representational similarity matrix (RSM), neighborhood-based, statistics-based, and topology-based paradigms:
| Class | Measure | Formula (LaTeX) Example |
|---|---|---|
| Alignment-based | HardCorr | |
| OrthProc | , orthogonal | |
| CCA-based | SVCCA, PWCCA | |
| RSM-based | CKA | |
| DistCorr | ||
| Neighborhood-based | 2nd-Cos, Jaccard | Cosine/Jaccard over -NN graphs |
| Topology-based | IMD | Approximates Gromov–Hausdorff distance |
Notable invariance and cost distinctions include, for example, CKA being invariant to orthonormal transforms and requiring complexity; 2nd-Cos preserving local topological properties; and IMD capturing global manifold structure.
4. Architectures and Datasets
ReSi covers eleven neural architectures over six representative datasets, facilitating domain-specific analyses and extensibility:
| Domain | Architectures | Datasets |
|---|---|---|
| Graph | GCN, GraphSAGE, GAT | Cora, Flickr, OGBN-Arxiv |
| Language | BERT-Base (25 Multiberts seeds, fine-tuned) | SST2, MNLI |
| Vision | ResNet-18/34/101, VGG-11/19, ViT-B/32, ViT-L/32 | ImageNet-100 |
This integrative matrix enables comparison of metric behavior across classical, attention-based, and transformer architectures over node classification, sequence/NLI, and large-scale image tasks. The benchmark is extensible, supporting “14×7” architecture/dataset pairs.
5. Experimental Insights and Measure Performance
Cross-domain benchmarking yields several key findings:
- No measure universally dominates; performance is regime-dependent. Measures highly ranked in one domain or test often fail in others.
- Graph domain: Neighborhood-based measures (2nd-Cos, Jaccard, RankSim) excel.
- Language domain: Alignment/angular measures (OrthProc, AlignCos, AngShape) perform best.
- Vision domain: CKA and Procrustes variants demonstrate leading fidelity.
- Popularity does not imply robustness; highly cited SVCCA and RSA do not consistently score in the top tier.
- Orthogonal Procrustes (OrthProc) exhibits exceptional stability, never ranking in the lowest third in any domain.
- There are pronounced trade-offs: CKA and DistCorr offer high fidelity but are computationally expensive; neighborhood metrics capture fine-grained variation but underperform on layer monotonicity.
- In label randomization, RSMDiff and IMD achieve near-perfect AUPRC, while alignment measures may approach random chance; in layer monotonicity, EOS and DistCorr attain 100% conformity on CNNs, whereas statistics-based measures can give negative Spearman correlation.
6. Implementation and Reproducibility
ReSi is accompanied by an open-source codebase (github.com/mklabunde/resi) supporting reproducible training, evaluation, and representation extraction for all included domains:
- Models are trained with established packages: PyG for GNNs, HuggingFace for BERT, and PyTorch for vision networks.
- Full pretrained checkpoints and dumps of representation matrices are available.
- Data splits adhere to standard protocols of respective fields.
- Evaluation employs predefined metrics for each test (Spearman correlation, AUPRC, conformity rate), matching the official specifications.
- Hyperparameters: for neighborhood-based measures, linear kernels for CKA, standard and step counts for GULP and IMD.
- Compute resources include up to 80 GB GPUs and CPU clusters ranging to 1024 GB RAM.
This supports systematic reproduction and expansion of the benchmark for new models, datasets, or similarity measures.
7. Directions for Extension
Potential future advances for the ReSi framework include:
- Introducing new measure classes (mutual-information-based, affine-aligned, learned similarity metrics).
- Adding domains such as audio, reinforcement learning, and self-supervised learning.
- Expanding architecture coverage to lightweight CNNs (e.g. MobileNet), large-scale LLMs (GPT-2/3), and graph transformers.
- Developing new groundings such as style-bias assessment, out-of-distribution robustness, and privacy or leakage detection.
- Conducting systematic hyperparameter sweeps (e.g. varying , kernel choices).
- Integrating runtime and memory cost as explicit criterion in benchmark scores.
A plausible implication is that benchmarking representational similarity requires multidimensional evaluation, domain-aware selection of metrics, and explicit grounding in practical prediction and design contexts. ReSi provides a modular foundation for rigorous ongoing research in representational similarity analysis (Klabunde et al., 2024).