RMTS: A Visual Relational Reasoning Benchmark
- RMTS is a benchmark that evaluates abstract relational reasoning by comparing same/different relations between ordered object pairs based on color and shape.
- It employs a two-stage computational pipeline in Vision Transformers where early layers perform perceptual disentanglement and later layers compute relational comparisons.
- RMTS findings reveal that strong object-level representations alone are insufficient for generalization, necessitating explicit inductive biases for robust relational reasoning.
The Relational Match-to-Sample (RMTS) task is a rigorous benchmark for abstract relational reasoning with visual stimuli, requiring a model to determine whether two ordered object pairs exhibit the same “sameness” or “difference” relation along discrete attributes (typically color and shape). RMTS thus operationalizes a hierarchical visual reasoning process that separates object-level feature extraction from high-level relational comparison, and has been used to probe the inductive and emergent capabilities of modern neural networks, particularly Vision Transformers (ViTs) (Lepori et al., 2024).
1. Task Formulation and Data Construction
The RMTS benchmark is constructed from a controlled object vocabulary: 16 distinct black-and-white shapes, and 16 color classes specified as Gaussian-noisy RGB profiles (e.g., red ), yielding unique objects. Importantly, color assignments are re-sampled per instance, ensuring that nominally identical colors vary at the pixel level and precluding trivial pixel-level matching.
Each stimulus consists of four objects arranged as two ordered pairs: a “display pair” in a fixed, upper-left position and a “sample pair” positioned randomly in one of the remaining locations. The core RMTS decision is defined hierarchically:
- For each pair, make a local “same/different” judgment on the conjunction of color and shape.
- If both pairs yield the same intermediate judgment (both “same” or both “different”), the image is labeled “same” at the top level; otherwise, it is labeled “different.”
Two patch sizes ( and ) are explored, leading to objects either spanning four adjacent (for ) or a single patch (for ), with strict patch alignment maintained for tokenization. Datasets are balanced by top-level label (50% “same”), with training and test splits matched for scale (RMTS discrimination sets, 6,400 images per split) (Lepori et al., 2024).
2. Vision Transformer Architecture and Training Regimen
The architecture follows the canonical Vision Transformer (ViT) model, with minimal adaptation for RMTS:
- Images are divided into patches , each of shape . Patches are flattened and projected:
- The embedded patch sequence is augmented with a learned [CLS] token and positional embeddings, producing an initial token matrix .
- Stacked Transformer blocks are applied, each consisting of Multi-Head Self-Attention (MSA) and MLP feed-forward subblocks, organized as:
- The [CLS] token at the final layer is projected to “same”/“different” logits via a linear head, trained with cross-entropy:
Training is performed for 200 epochs with AdamW (learning rate ), on both discrimination and RMTS objectives. Both pretrained (CLIP, DINO, ImageNet) and randomly initialized (“scratch”) ViT-B/16 and ViT-B/32 models are evaluated (Lepori et al., 2024).
3. Mechanistic Processing Pipeline: Disentanglement and Relational Computation
Mechanistic interpretability reveals a distinct two-stage computational pipeline in high-performing ViTs:
- Perceptual Stage (Early Layers): Local attention heads dominate, causing each object's patches to attend almost exclusively to themselves. These layers extract color and shape into disentangled, nearly orthogonal linear subspaces within the hidden representation, verified via Distributed Alignment Search (DAS) and counterfactual interventions.
- Relational Stage (Later Layers): Attention patterns become increasingly global. In discrimination tasks, object-level tokens begin to attend to one another; in RMTS, attention first peaks within each pair, then transitions to between-pair interactions, before broad background attention emerges. At this stage, ViTs construct an abstract “same/different” signal not directly tied to initial pixel embeddings.
The pipeline's transition typically occurs near layer 6 (in a 12-layer model), as made explicit by layerwise attention-pattern analysis, and is essential for successful generalization (Lepori et al., 2024).
4. Interpretability Tools and Interventional Analyses
A suite of interpretability techniques is used to dissect representation and compute stages:
- Attention-Pattern Analysis: For each head, the locality/globality score quantifies the proportion of attention exchanged among vs. within object tokens, revealing the sharp stagewise transition in information processing by layer.
- Distributed Alignment Search (DAS): The DAS optimization locates linear subspaces within the model’s residual stream that encode disentangled object features (color, shape). Swapping these subspaces between images flips the model’s predicted labels, confirming causal encoding.
- Novel-Representation Analysis: Injects interpolated or random feature vectors into these subspaces. The persistence of correct relational output under addition/interpolation, but not for random injection, indicates moderate abstraction without rote memorization.
- Linear Probing & Counterfactual Intervention: For each layer, linear probes are trained to decode the intermediate pairwise “same/different” decision. Counterfactually shifting representations along these probe-derived “same”/“different” directions demonstrably flips the model’s final RMTS judgment, with maximum efficacy observed near layer 5 (Lepori et al., 2024).
5. Quantitative Results and Compositional Generalization
Key quantitative outcomes for ViT-B/16 models include:
| Pretraining | Disc. Train/Test (%) | RMTS Train/Test (%) |
|---|---|---|
| CLIP | 100 / 99.3 | 100 / 98.3 |
| ImageNet | 100 / 97.5 | 99.7 / 89.3 |
| DINO | 100 / 95.6 | 100 / 87.7 |
| Scratch | 95.9 / 80.5 | 54.7 / 49.6 |
On compositional splits (train: each shape with two colors only), models with more disentangled object representations (higher DAS counterfactual accuracy) exhibit reduced generalization gaps and superior accuracy on OOD (novel shape–color) and compositional tests. Confusion matrices reveal RMTS misclassifications are balanced between “same” and “different” choices, without strong label bias (Lepori et al., 2024).
6. Failure Modes and Remedial Strategies
- Perceptual Failures: From-scratch or weakly pretrained models frequently fail to form disentangled feature subspaces, with diffuse attention and limited generalization even on the basic discrimination task (test 75%). Remediation via an “auxiliary disentanglement loss”—adding linear probes at intermediate layers to enforce representational separation of shape and color—substantially improves discrimination (from-scratch: test accuracy).
- Relational Failures: Even after enforcing perceptual disentanglement, from-scratch ViTs are unable to solve RMTS beyond chance (50%), indicating that superior object-level representations are insufficient for abstract relational computation. This suggests a need for explicit inductive biases or structural amendments (e.g., gating, architectural slots, or iterative comparisons) to reliably support generalizable relational reasoning (Lepori et al., 2024).
7. Position within the Broader Landscape of RMTS Research
The RMTS task operationalizes a crucial test of abstract relation learning, historically challenging for artificial neural networks. Prior studies established that convolutional architectures generalize poorly on same/different tasks, tending towards superficial memorization. Experimental evidence demonstrates that ViTs—especially with large-scale pretraining (e.g., CLIP)—can spontaneously instantiate a two-stage reasoning pipeline: initial perceptual disentanglement into object-centric subspaces, followed by relational computation via attention-mediated feature comparison. Mechanistic approaches (DAS, probing, interventional analysis) are integral for distinguishing authentic abstraction from shallow behavioral pattern-matching. Although contemporary CLIP-finetuned ViTs approach 98% RMTS accuracy, their abstraction capacity remains limited (e.g., fails on random vector injection) and is contingent on robust pretraining. Closing the gap to human-like zero-shot RMTS generalization is expected to require architectures endowed with stronger relational inductive biases (Lepori et al., 2024).