Deep Representational Similarity Learning

Updated 8 February 2026

DRSL is a framework that uses deep neural embeddings to model similarity relations in high-dimensional data, enabling extraction of task-relevant features.
It leverages supervision through human queries, contrastive loss, and regression techniques to align learned representations with ground truth or experimental design.
DRSL improves interpretability and performance in applications such as robotics, fMRI analysis, and vision model comparison by tailoring loss functions to specific tasks.

Deep Representational Similarity Learning (DRSL) is an overarching framework that encompasses deep learning approaches designed to extract, compare, and analyze task-relevant or semantically meaningful representations in high-dimensional data through the explicit modeling, supervision, or interpretation of similarity relations. DRSL extends and operationalizes principles from classical representational similarity analysis (RSA) and contrastive learning, producing representations that either facilitate efficient downstream reward or preference learning, enable interpretable network comparisons, or align neural signatures across cognitive states via customized nonlinear transforms. Principal instantiations include Similarity-based Implicit Representation Learning (SIRL), Deep-NN-based extensions to RSA, and interpretable concept-based inter-model analysis.

1. Formalization and Scope

DRSL generalizes representational similarity analysis by employing learned deep embeddings, instead of fixed linear or kernel mappings, to model the (dis)similarity structure imposed by data or expert supervision. If $\mathcal{X}$ denotes the input space (e.g., trajectories, images, neural activations), then DRSL’s core objective is to learn a map $\phi:\mathcal{X} \to \mathbb{R}^d$ such that a chosen distance or similarity measure (e.g., squared Euclidean, correlation) between $\phi(x_1),\phi(x_2)$ aligns with either ground truth, human-elicited, or task-derived notions of similarity. DRSL encompasses supervision regimes where similarity is provided by labels (SIRL), inferred via fuzzy concept bases (RSVC), or optimized relative to downstream task design matrices (fMRI DRSL) (Bobu et al., 2023, Kondapaneni et al., 19 Mar 2025, Yousefnezhad et al., 2020).

2. Methodological Approaches

2.1. Human-Supervised DRSL: Similarity-Based Implicit Representation Learning (SIRL)

SIRL addresses the challenge of learning reward functions from raw trajectories in robotic settings where the relevant features are neither known a priori nor properly encoded by standard deep representations. SIRL uses triplet similarity queries collected from human users, where for each trio $(\xi^a, \xi^b, \xi^c)$ , the user chooses the most similar pair. This triplet is encoded as a positive pair and a negative instance. The embedding network $\phi$ is trained by minimizing a symmetrized margin-based triplet loss: $\mathcal{L}_{sim}(\phi) = \sum_{i=1}^{|\mathcal{D}_{sim}|} \left[ \mathcal{L}_{triplet}(\xi_{P_1}^i, \xi_{P_2}^i, \xi_N^i) + \mathcal{L}_{triplet}(\xi_{P_2}^i, \xi_{P_1}^i, \xi_N^i) \right]$ where

$\mathcal{L}_{triplet}(\xi_A,\xi_P,\xi_N) = \max\{ \|\phi(\xi_A) - \phi(\xi_P)\|_2^2 - \|\phi(\xi_A) - \phi(\xi_N)\|_2^2 + \alpha,\,0 \}$

$\alpha$ is the margin hyperparameter, and $\mathcal{D}_{sim}$ contains the labeled triplets. Once $\phi$ is trained, it is frozen as a backbone for preference learning, accelerating reward inference across new tasks (Bobu et al., 2023).

2.2. Deep RSA: Nonlinear Representation Learning for fMRI Analysis

DRSL for fMRI analysis replaces the classical linear signature extraction of RSA with subject-specific deep neural networks. These networks $f(\cdot;\theta^{(\ell)})$ are trained so that, together with per-subject regression matrices $B^{(\ell)}$ , their outputs predict the design matrices encoding experimental conditions. The main loss couples neural nonlinearity and regression: $J_R^{(k,\ell)}(B^{(\ell)},\theta^{(\ell)}) = \sum_{i\in\Psi^{(k,\ell)}} \| f(x_{i,.}^{(\ell)};\theta^{(\ell)}) - d_{i,.}^{(\ell)} B^{(\ell)} \|_2^2 + r(B^{(\ell)})$ where $r(B)$ regularizes for sparsity and noise. Unlike fixed kernel or linear approaches, each subject’s transform can be fully nonlinear, enabling much higher fidelity to latent neural similarity structure (Yousefnezhad et al., 2020).

2.3. Interpretable Model Comparison: Representational Similarity via Visual Concepts (RSVC)

RSVC seeks to reveal what features are shared vs. unique between deep vision models. For each model, interpretable concept vectors are extracted by nonnegative matrix factorization of layer activations over class-conditioned image patch sets. DRSL here appears as the nonlinear maps $\phi$ (the model layers), with similarity measured at the concept-coefficient level via correlation or by regressing one model’s concept coefficients from the other’s activations: $\min_{W^*} \frac{1}{n} \|A_1 W^* - U_2\|_F^2 + \lambda \|W^*\|_1$ where $A_1$ are the activations from $M_1$ and $U_2$ concept coefficients from $M_2$ (Kondapaneni et al., 19 Mar 2025).

3. Key Architectures, Algorithms, and Loss Functions

Framework	Input Domain	Supervision	Embedding	Loss Function
SIRL (Bobu et al., 2023)	Trajectories (robotics)	Human triplet queries	FCN ( $d$ -dim.)	Margin-based triplet loss ( $\mathcal{L}_{sim}$ ), VAE pretraining (optional)
DRSL-fMRI (Yousefnezhad et al., 2020)	fMRI time series	Experimental design	DNN, subject-specific	Coupled regression and regularization ( $J_R^{(k,\ell)}$ ), sparse penalty
RSVC (Kondapaneni et al., 19 Mar 2025)	Image patches	Model activations	Layerwise DNN (pretrained)	Concept-regression + correlation, NMF constraints

Each paradigm exploits deep embedding networks $\phi$ or $f$ to map high-complexity inputs into low-dimensional, task-aligned, or interpretable spaces. Training optimizes loss functions that directly encode similarity constraints (e.g., triplet margin, regression to human or model concepts), diverging from image-augmentation-driven positive pair selection in standard contrastive learning. In SIRL, triplet loss is driven by user judgments; in DRSL-fMRI, regression error ties latent representations to stimulus–response structure; in RSVC, cross-model regression formalizes similarity between abstracted concept bases.

4. Empirical Evaluation and Results

SIRL (Robotics)

In simulated and real-user studies on GridRobot and JacoRobot domains, SIRL demonstrates that similarity-label-driven representations yield lower feature prediction error (FPE) and higher test preference accuracy (TPA) than unsupervised or implicit (preference-based) baselines. For example, with $\approx$ 1000 similarity queries, SIRL achieves $\approx$ 0.02 MSE (FPE), halving error relative to unsupervised embeddings, and attains $\approx$ 90% TPA versus lower scores for alternate approaches. SIRL thus isolates human-relevant, generalizable features that transfer across task variations (Bobu et al., 2023).

fMRI DRSL

On multi-subject, multi-task fMRI datasets, DRSL achieves lower between-class correlation (e.g., max $|$ corr $|$ $\approx$ 0.37 on R105) and elevated classification accuracy (e.g., $\approx$ 91.4% memory task accuracy) relative to classical and kernel-RSA methods. These gains are robust to ablation of network depth and hyperparameters, provided modest network width relative to temporal sample count (Yousefnezhad et al., 2020).

RSVC (Vision Model Comparison)

RSVC finds that early layers in diverse vision architectures (ResNets, ViTs, MAE, DINO) are highly similar across models, with similarity decaying in deeper layers before partially resurging. Unique, functionally significant concepts are discovered via low CMCS scores; e.g., a "pink-square" concept absent from a control model. Cross-model concept regression enables both functional and interpretive analysis of divergences, and bidirectional regression highlights asymmetry in representational sharing. Computational demands for large-scale concept extraction are noted (Kondapaneni et al., 19 Mar 2025).

5. Significance, Interpretability, and Limitations

DRSL methods provide improved alignment with causal and human-meaningful factors over purely unsupervised or label-driven deep representations. SIRL leverages human-in-the-loop similarity as an efficient route to causal feature discovery; RSVC translates model similarity into interpretable, concept-level explanations; fMRI DRSL accommodates subject-specific, nonlinear embeddings tractable for high-dimensional brain data.

A critical distinction is that, in DRSL, supervision or similarity is not enforced by algorithm designers (data augmentation schemes, fixed kernel choices), but by users, downstream task constraints, or interpretable, cross-model concept bases. This improves generalizability and functional relevance.

Principal limitations include sample complexity for human-driven labeling, computational costs for large-scale concept discovery, and in pipeline-specific cases, domain specialization (e.g., RSVC for vision, fMRI DRSL for neuroimaging). Linear interpretability constraints in RSVC may introduce reconstruction error or partial concept entanglement (Bobu et al., 2023, Kondapaneni et al., 19 Mar 2025, Yousefnezhad et al., 2020).

DRSL extends contrastive and self-supervised paradigms by relocating the origin of similarity: from hand-crafted data augmentations to human perception, experimental design, or inter-model basis alignment. In contrast to CCA, CKA, or scalar RSA scores, DRSL frameworks not only deliver quantitative similarity scores but also facilitate downstream functionality and interpretability.

Potential directions include active query strategies for SIRL to maximize information from similarity judgments, fusion with self-supervised pretraining for efficiency, architectural extensions (e.g., transformers for sequential data), cross-modal generalization (e.g., to text or multi-sensor settings), and large-scale crowdsourced representation learning for robotics. In fMRI, cross-subject alignment and joint modeling of spatial–temporal priors remain open challenges (Bobu et al., 2023, Kondapaneni et al., 19 Mar 2025, Yousefnezhad et al., 2020).

Markdown Report Issue Upgrade to Chat

References (3)

SIRL: Similarity-based Implicit Representation Learning (2023)

Representational Similarity via Interpretable Visual Concepts (2025)

Deep Representational Similarity Learning for analyzing neural signatures in task-based fMRI dataset (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deep Representational Similarity Learning (DRSL).