Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 172 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 27 tok/s Pro

GPT-5 High 32 tok/s Pro

GPT-4o 99 tok/s Pro

Kimi K2 203 tok/s Pro

GPT OSS 120B 447 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

Gecko: Versatile Text Embeddings Distilled from Large Language Models (2403.20327v1)

Published 29 Mar 2024 in cs.CL and cs.AI

Abstract: We present Gecko, a compact and versatile text embedding model. Gecko achieves strong retrieval performance by leveraging a key idea: distilling knowledge from LLMs into a retriever. Our two-step distillation process begins with generating diverse, synthetic paired data using an LLM. Next, we further refine the data quality by retrieving a set of candidate passages for each query, and relabeling the positive and hard negative passages using the same LLM. The effectiveness of our approach is demonstrated by the compactness of the Gecko. On the Massive Text Embedding Benchmark (MTEB), Gecko with 256 embedding dimensions outperforms all existing entries with 768 embedding size. Gecko with 768 embedding dimensions achieves an average score of 66.31, competing with 7x larger models and 5x higher dimensional embeddings.

References (48)

Citations (25)

View on Semantic Scholar

Summary

The paper introduces Gecko, which distills large language models into compact embeddings using a two-step synthetic data generation and refinement process.
The methodology leverages LLMs to generate and rank synthetic paired data, significantly enhancing the identification of positive and hard negative passages.
The 256-dimensional Gecko model outperforms larger embeddings on the MTEB benchmark, demonstrating efficiency and robust performance in various NLP tasks.

Gecko: Versatile Text Embeddings Distilled from LLMs

Introduction to Gecko

The research introduced by Jinhyuk Lee et al. presents Gecko, a compact yet versatile text embedding model, remarkable for its capability to leverage knowledge from LLMs. This approach streamlines the embedding model's ability to perform across a broad spectrum of tasks, including document retrieval, sentence similarity, classification, and clustering, by distilling knowledge from LLMs through a novel two-step process. The pivotal mechanism behind Gecko's efficacy is the generation and subsequent refining of synthetic paired data using LLMs, emphasizing the rediscovery and relabeling of positive and hard negative passages for enhanced data quality.

Distilling Knowledge from LLMs

Gecko pioneers in distilling the vast knowledge embedded within LLMs to enhance text embedding models. This process is facilitated through a two-step procedure where initially, diverse synthetic paired data is generated using few-shot prompts facilitated by LLMs. Following this, a refinement of data quality is performed by retrieving candidate passages and using LLMs to rank these passages—emphatically identifying more relevant positive and hard negative passages than the originally generated pairs. This meticulous approach not only enhances the quality of the synthetic data but redefines standard practices in identifying positive targets for query-generated passages.

Unveiling Gecko's Efficacy

The compactness of the Gecko model does not compromise its effectiveness. Impressively, it outperforms existing models with significantly larger embedding sizes on the Massive Text Embedding Benchmark (MTEB). Specifically, Gecko with 256 embedding dimensions sets a new standard by surpassing all entries with 768 embedding dimensions. Further extending its capabilities, the 768-dimensional variant of Gecko competes closely with models that are substantially larger in size and higher in embedding dimensions, achieving an average score of 66.31 on MTEB. This performance is a testament to Gecko's efficient design and the innovative use of LLMs in improving text embeddings.

Implications and Future Directions

The introduction of Gecko marks a significant step forward in the field of text embeddings and LLM utilization. By demonstrating that knowledge from LLMs can be effectively distilled into compact embedding models, this research opens new avenues for creating efficient, general-purpose text embeddings. The success of Gecko suggests future research could explore further optimization of synthetic data generation processes, refinement methods for distillation, and the potential extensibility of this approach to other languages and tasks. Additionally, the model's compactness combined with its performance highlights the potential for deploying high-quality text embeddings in resource-constrained environments, broadening the accessibility and applicability of advanced NLP technologies.

Conclusion

In summary, Gecko employs a novel approach to text embedding by harnessing the power of LLM-derived synthetic data, resulting in a model that is both compact and versatile. The method of enhancing data quality through the LLM-based identification of relevant passages presents a promising direction for future research in text embeddings and the utilization of LLMs. This research not only demonstrates Gecko's superior performance in a wide range of NLP tasks but also underscores the potential of LLMs in revolutionizing the development of efficient and general-purpose embedding models.