Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 30 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 12 tok/s Pro
GPT-4o 91 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

LDIR: Low-Dimensional Dense and Interpretable Text Embeddings with Relative Representations (2505.10354v2)

Published 15 May 2025 in cs.CL

Abstract: Semantic text representation is a fundamental task in the field of natural language processing. Existing text embedding (e.g., SimCSE and LLM2Vec) have demonstrated excellent performance, but the values of each dimension are difficult to trace and interpret. Bag-of-words, as classic sparse interpretable embeddings, suffers from poor performance. Recently, Benara et al. (2024) propose interpretable text embeddings using LLMs, which forms "0/1" embeddings based on responses to a series of questions. These interpretable text embeddings are typically high-dimensional (larger than 10,000). In this work, we propose Low-dimensional (lower than 500) Dense and Interpretable text embeddings with Relative representations (LDIR). The numerical values of its dimensions indicate semantic relatedness to different anchor texts through farthest point sampling, offering both semantic representation as well as a certain level of traceability and interpretability. We validate LDIR on multiple semantic textual similarity, retrieval, and clustering tasks. Extensive experimental results show that LDIR performs close to the black-box baseline models and outperforms the interpretable embeddings baselines with much fewer dimensions. Code is available at https://github.com/szu-tera/LDIR.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

Low-Dimensional Dense and Interpretable Text Embeddings with Relative Representations (LDIR)

Wang et al. introduce LDIR, an innovative text embedding approach that addresses the challenges of interpretability and dimensionality in semantic text representation. Existing models like SimCSE and LLM2Vec offer robust performance but limit interpretability due to complex, high-dimensional embeddings. Interpretable embeddings derived from LLMs, such as 0/1 embeddings proposed by Benara et al., achieve transparency but require significant dimensionality (over 10,000), which can impede computational efficiency.

LDIR seeks to deliver interpretable embeddings that are both low-dimensional (below 500) and dense. The authors achieve this by leveraging relative representations, defining the value of each dimension through semantic relevance to selected anchor texts. This transformation utilizes the farthest point sampling method, ensuring diverse anchor text selection and effective semantic coverage with a reduced dimension count.

The paper's methodology employs existing encoders, like SBERT and AngIE, and introduces anchor text sampling to capture semantic relatedness. Unlike previous models demanding extensive training or question crafting, LDIR streamlines the embedding process by mapping text representations directly to anchor texts, thus maintaining interpretability alongside competitive performance metrics.

The authors validate LDIR across various semantic textual similarity, retrieval, and clustering tasks, demonstrating its efficacy. Despite its reduced dimensionality, LDIR consistently outperforms other interpretable models and closely rivals black-box embeddings in semantic tasks. The implications for LDIR are multifaceted: it offers an efficient alternative for scenarios demanding interpretability without sacrificing the depth of semantic representation, presenting opportunities for advancements in explainable AI applications.

Future considerations should focus on refining the anchor text selection process, potentially optimizing for task-specific contexts to enhance performance further. Moreover, expanding metric frameworks to assess interpretability in dense embeddings could offer valuable insights, propelling the development of transparent yet powerful text embeddings.

Overall, LDIR presents a promising avenue for applications necessitating interpretable embeddings, balancing computational demands with the growing necessity for explainability in AI systems.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube