Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 30 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 18 tok/s Pro

GPT-5 High 12 tok/s Pro

GPT-4o 91 tok/s Pro

Kimi K2 184 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

LDIR: Low-Dimensional Dense and Interpretable Text Embeddings with Relative Representations (2505.10354v2)

Published 15 May 2025 in cs.CL

Abstract: Semantic text representation is a fundamental task in the field of natural language processing. Existing text embedding (e.g., SimCSE and LLM2Vec) have demonstrated excellent performance, but the values of each dimension are difficult to trace and interpret. Bag-of-words, as classic sparse interpretable embeddings, suffers from poor performance. Recently, Benara et al. (2024) propose interpretable text embeddings using LLMs, which forms "0/1" embeddings based on responses to a series of questions. These interpretable text embeddings are typically high-dimensional (larger than 10,000). In this work, we propose Low-dimensional (lower than 500) Dense and Interpretable text embeddings with Relative representations (LDIR). The numerical values of its dimensions indicate semantic relatedness to different anchor texts through farthest point sampling, offering both semantic representation as well as a certain level of traceability and interpretability. We validate LDIR on multiple semantic textual similarity, retrieval, and clustering tasks. Extensive experimental results show that LDIR performs close to the black-box baseline models and outperforms the interpretable embeddings baselines with much fewer dimensions. Code is available at https://github.com/szu-tera/LDIR.

Collections

Summary

Low-Dimensional Dense and Interpretable Text Embeddings with Relative Representations (LDIR)

Wang et al. introduce LDIR, an innovative text embedding approach that addresses the challenges of interpretability and dimensionality in semantic text representation. Existing models like SimCSE and LLM2Vec offer robust performance but limit interpretability due to complex, high-dimensional embeddings. Interpretable embeddings derived from LLMs, such as 0/1 embeddings proposed by Benara et al., achieve transparency but require significant dimensionality (over 10,000), which can impede computational efficiency.

LDIR seeks to deliver interpretable embeddings that are both low-dimensional (below 500) and dense. The authors achieve this by leveraging relative representations, defining the value of each dimension through semantic relevance to selected anchor texts. This transformation utilizes the farthest point sampling method, ensuring diverse anchor text selection and effective semantic coverage with a reduced dimension count.

The paper's methodology employs existing encoders, like SBERT and AngIE, and introduces anchor text sampling to capture semantic relatedness. Unlike previous models demanding extensive training or question crafting, LDIR streamlines the embedding process by mapping text representations directly to anchor texts, thus maintaining interpretability alongside competitive performance metrics.

The authors validate LDIR across various semantic textual similarity, retrieval, and clustering tasks, demonstrating its efficacy. Despite its reduced dimensionality, LDIR consistently outperforms other interpretable models and closely rivals black-box embeddings in semantic tasks. The implications for LDIR are multifaceted: it offers an efficient alternative for scenarios demanding interpretability without sacrificing the depth of semantic representation, presenting opportunities for advancements in explainable AI applications.

Future considerations should focus on refining the anchor text selection process, potentially optimizing for task-specific contexts to enhance performance further. Moreover, expanding metric frameworks to assess interpretability in dense embeddings could offer valuable insights, propelling the development of transparent yet powerful text embeddings.

Overall, LDIR presents a promising avenue for applications necessitating interpretable embeddings, balancing computational demands with the growing necessity for explainability in AI systems.