Do We Really Need Specialization? Evaluating Generalist Text Embeddings for Zero-Shot Recommendation and Search (2507.05006v1)

Published 7 Jul 2025 in cs.IR and cs.CL

Abstract: Pre-trained LLMs (PLMs) are widely used to derive semantic representations from item metadata in recommendation and search. In sequential recommendation, PLMs enhance ID-based embeddings through textual metadata, while in product search, they align item characteristics with user intent. Recent studies suggest task and domain-specific fine-tuning are needed to improve representational power. This paper challenges this assumption, showing that Generalist Text Embedding Models (GTEs), pre-trained on large-scale corpora, can guarantee strong zero-shot performance without specialized adaptation. Our experiments demonstrate that GTEs outperform traditional and fine-tuned models in both sequential recommendation and product search. We attribute this to a superior representational power, as they distribute features more evenly across the embedding space. Finally, we show that compressing embedding dimensions by focusing on the most informative directions (e.g., via PCA) effectively reduces noise and improves the performance of specialized models. To ensure reproducibility, we provide our repository at https://split.to/gte4ps.

Summary

The paper demonstrates that generalist text embedding models can outperform traditional fine-tuned systems in zero-shot recommendation and search tasks.
The study employs PCA to reveal that uniform embedding space utilization is key to improved performance.
Empirical evaluations on Amazon Reviews and ESCI datasets show that even smaller, decoder-based models rival larger, specialized architectures.

Evaluating the Necessity of Specialization: Generalist Text Embeddings for Zero-Shot Recommendation and Search

This paper systematically investigates the efficacy of Generalist Text Embedding Models (GTEs) in zero-shot settings for sequential recommendation and product search, challenging the prevailing assumption that task- or domain-specific fine-tuning is essential for optimal performance. The authors conduct comprehensive empirical evaluations, analyze embedding space properties, and provide insights into the representational characteristics that underpin the observed performance of GTEs.

Summary of Contributions

The central claim is that GTEs, trained on large-scale, diverse corpora without task-specific adaptation, can outperform both traditional ID-based models and specialized, fine-tuned text encoders in recommendation and search tasks. The paper further explores the geometric properties of embedding spaces, demonstrating that GTEs utilize embedding dimensions more uniformly, which correlates with improved downstream performance and scalability.

Experimental Design and Results

The evaluation spans two core tasks:

Sequential Recommendation (SR): Predicting the next item in a user's interaction sequence, leveraging item metadata for embedding.
Product Search (PS): Retrieving relevant items from a catalog in response to natural language queries, using dense vector representations for both queries and items.

The experiments utilize the Amazon Reviews 2023 dataset for SR and both ESCI and Amazon-C4 for PS, ensuring comparability with prior work. The models compared include classical baselines (GRU4Rec, SASRec), fine-tuned models (BLAIR), closed-source solutions (OpenAI's text-embedding-3-large), and a suite of recent GTEs (NVEmbed-v2, GTE-Qwen2, KALM, Jasper, mGTE, INSTRUCTOR, Sentence-T5).

Key empirical findings:

GTEs consistently outperform both fine-tuned and closed-source models across all evaluated domains and tasks. For instance, NVEmbed-v2 and GTE-Qwen2 achieve statistically significant improvements over BLAIR and OpenAI's t-emb-3 in both SR and PS.
Text-based models surpass ID-based models in SR, underscoring the value of semantic item information.
Model capacity and embedding dimensionality do not linearly correlate with performance. Smaller GTEs (e.g., KALM) can outperform much larger models (e.g., INSTRUCTORXL, Sentence-T5XXL) under certain conditions.
Decoder-based GTEs (e.g., Jasper, GTE-Qwen2, NVEmbed-v2) outperform encoder-based models, suggesting that generative or retrieval-oriented pretraining objectives yield more effective representations for retrieval and recommendation.

Embedding Space Analysis

A significant portion of the analysis is devoted to understanding why GTEs excel in these tasks. The authors employ PCA to assess the effective dimensionality and variance distribution of embedding spaces:

GTEs exhibit more uniform space utilization, with variance distributed across a larger number of dimensions, reducing the risk of dimensional collapse and anisotropy.
Fine-tuned models (e.g., BLAIR) display high anisotropy, with much of the variance concentrated in a small subset of dimensions, which can degrade performance in distance-based retrieval tasks.
PCA-based compression is effective: Retaining only the most informative principal components can reduce embedding dimensionality without sacrificing, and sometimes even improving, downstream performance. This is particularly beneficial for scaling large GTEs and for denoising fine-tuned models.

Implications and Future Directions

The findings have several practical and theoretical implications:

Deployment Efficiency: GTEs obviate the need for costly and time-consuming task-specific fine-tuning, enabling rapid deployment in new domains and tasks with strong zero-shot performance.
Scalability: The ability to compress embeddings via PCA without loss of accuracy facilitates the use of high-capacity GTEs in large-scale retrieval systems, reducing storage and computational requirements.
Model Selection: The lack of a direct relationship between model size and performance suggests that practitioners should prioritize architectural and training design over brute-force scaling.
Research Directions: The results motivate further exploration of embedding isotropy, disentanglement, and the development of architectures that inherently promote uniform space utilization. Additionally, the superior performance of decoder-based GTEs warrants deeper investigation into the impact of pretraining objectives and model architectures on embedding quality.

Conclusion

This work provides robust evidence that generalist, large-scale text embedding models can serve as strong, training-free baselines for recommendation and search, often surpassing specialized, fine-tuned alternatives. The analysis of embedding space geometry offers a compelling explanation for these results and points to practical strategies for model compression and deployment. Future research should focus on further improving the intrinsic properties of embeddings and understanding the interplay between model architecture, training objectives, and downstream task performance.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_reachsumit/status/1942463379639349654