Interpretable Text Embeddings and Text Similarity Explanation: A Primer (2502.14862v1)

Published 20 Feb 2025 in cs.CL, cs.AI, and cs.IR

Abstract: Text embeddings and text embedding models are a backbone of many AI and NLP systems, particularly those involving search. However, interpretability challenges persist, especially in explaining obtained similarity scores, which is crucial for applications requiring transparency. In this paper, we give a structured overview of interpretability methods specializing in explaining those similarity scores, an emerging research area. We study the methods' individual ideas and techniques, evaluating their potential for improving interpretability of text embeddings and explaining predicted similarities.

Summary

The paper provides a structured primer on methods for achieving interpretability in text embeddings and explaining text similarity scores.
It categorizes approaches into space shaping methods, which structure the embedding space using techniques like feature decomposition or non-Euclidean geometry.
Set-based methods rely on matching interpretable item sets (e.g., tokens) like ColBERT, while attribution methods attribute predictions onto input features using techniques such as Integrated Jacobians.

The paper addresses the interpretability challenges associated with text embeddings and the explanation of similarity scores, which are critical for applications requiring transparency. It provides a structured overview of emerging research on methods designed to explain these similarity scores. The paper categorizes and analyzes these methods, evaluating their potential to enhance the interpretability of text embeddings and explain predicted similarities.

The paper categorizes explainability approaches into three perspectives: space shaping, set-based, and attribution-based.

Space Shaping Approaches

Space shaping approaches aim to structure the embedding space to enhance interpretability, shaping the space to express aspects, interpretable geometries, or probabilistic distributions.

Feature Decomposition: These methods combine the interpretability of bag-of-words features with the power of neural embeddings. One approach involves framing embedding generation as answering predefined questions about a text and encoding the answers as features [e.g., answering "Yes"/"No" questions using LLM]. Another approach involves decomposing the embedding space into multi-dimensional subspaces, each isolating a specific semantic aspect. For instance, S3BERT (Semantically Structured SBERT) requires users to define metrics that measure interpretable similarity aspects of two texts, leveraging Abstract Meaning Representation graphs and graph matching metrics.
Non-Euclidean Geometry: These methods model text relationships using geometries that can capture asymmetric relations. Box embeddings, for example, model text as boxes and use overlap to represent similarity and containment for tasks like entailment. Probabilistic text embeddings, viewing a text as a random variable, model texts as Gaussian RVs $\mathcal{N}_d(\mu, \Sigma)$ $N_{d} (μ, Σ)$ by estimating model uncertainty via Monte Carlo Dropout and data uncertainty via linguistic perturbations, where:
- $\mathcal{N}_d$ is a d-dimensional Gaussian distribution
- $\mu$ is the mean of the Gaussian distribution
- $\Sigma$ is the covariance matrix
Combining Token Embeddings: Combination-based approaches build a new embedding space by aggregating token-level representations with explicit weights that reflect their importance.

Set-based Approaches

Set-based approaches rely on matching two sets rather than two points, typically consisting of human-interpretable items (e.g., tokens).

Embedding Set Interpretability: Alignment-based methods derive similarity by aligning token embeddings from one text with those of another, typically using embeddings from the last layer of a model. Techniques like ColBERT and BERTscore compute similarity scores based on greedy max-matching. In ColBERT, given texts $x, y$ $x, y$ and encoders $F, G$ $F, G$ , the similarity is $sim(x, y) = \sum_{t \in x} max([F(x)^T G(y)]_t)$ $s im (x, y) = \sum_{t \in x} ma x ([F (x)^{T} G (y)]_{t})$ , where:
- $sim(x, y)$ is the similarity score between texts $x$ and $y$
- $t$ iterates over the tokens in text $x$
- $F(x)$ and $G(y)$ are the embeddings of texts $x$ and $y$ , respectively
Explicit Multi-Interpretation: This class of methods generates sets of text embeddings by either hypothesizing about a text or decomposing it into smaller parts, matching facts contained in a text with different abstractness levels.

Attribution-based Approaches

Attribution-based approaches aim to attribute a model prediction onto input or intermediate feature representations.

Integrated Jacobians: This method extends the theory behind integrated gradients to Siamese models, enabling the attribution of similarity predictions onto feature interactions between the two inputs. The output takes the form of a feature-pair attribution matrix.
BiLRP: This method extends the Layer-wise Relevance Propagation framework to Siamese similarity models by computing LRP values for each embedding dimension of the two encoders separately and subsequently taking their matrix product.

The paper also discusses challenges, including mitigating tradeoffs between interpretability, computational cost, fidelity to input tokens, and dependencies on specific models. It addresses the question of what constitutes the "right" explanation, noting that no single method can be guaranteed faithful and suggests using multiple methods as independent pieces of evidence. Finally, the paper presents datasets such as iSTS and C-STS that elicit explanations and discusses similarity interpretability studies.