We present Gecko, a compact and versatile text embedding model. Gecko achieves strong retrieval performance by leveraging a key idea: distilling knowledge from LLMs into a retriever. Our two-step distillation process begins with generating diverse, synthetic paired data using an LLM. Next, we further refine the data quality by retrieving a set of candidate passages for each query, and relabeling the positive and hard negative passages using the same LLM. The effectiveness of our approach is demonstrated by the compactness of the Gecko. On the Massive Text Embedding Benchmark (MTEB), Gecko with 256 embedding dimensions outperforms all existing entries with 768 embedding size. Gecko with 768 embedding dimensions achieves an average score of 66.31, competing with 7x larger models and 5x higher dimensional embeddings.
Gecko is a new text embedding model that uses knowledge from LLMs to perform a variety of tasks like document retrieval and sentence similarity.
It involves a two-step process of generating and refining synthetic paired data using LLMs, focusing on improving data quality by identifying more relevant passages.
Gecko outperforms larger models in the Massive Text Embedding Benchmark (MTEB), even with smaller embedding sizes, highlighting its efficiency and effectiveness.
The research suggests future exploration in optimizing data generation, refining distillation methods, and extending this approach to more languages and tasks.
The research introduced by Jinhyuk Lee et al. presents Gecko
, a compact yet versatile text embedding model, remarkable for its capability to leverage knowledge from LLMs. This approach streamlines the embedding model's ability to perform across a broad spectrum of tasks, including document retrieval, sentence similarity, classification, and clustering, by distilling knowledge from LLMs through a novel two-step process. The pivotal mechanism behind Gecko's efficacy is the generation and subsequent refining of synthetic paired data using LLMs, emphasizing the rediscovery and relabeling of positive and hard negative passages for enhanced data quality.
Gecko pioneers in distilling the vast knowledge embedded within LLMs to enhance text embedding models. This process is facilitated through a two-step procedure where initially, diverse synthetic paired data is generated using few-shot prompts facilitated by LLMs. Following this, a refinement of data quality is performed by retrieving candidate passages and using LLMs to rank these passages—emphatically identifying more relevant positive and hard negative passages than the originally generated pairs. This meticulous approach not only enhances the quality of the synthetic data but redefines standard practices in identifying positive targets for query-generated passages.
The compactness of the Gecko model does not compromise its effectiveness. Impressively, it outperforms existing models with significantly larger embedding sizes on the Massive Text Embedding Benchmark (MTEB). Specifically, Gecko with 256 embedding dimensions sets a new standard by surpassing all entries with 768 embedding dimensions. Further extending its capabilities, the 768-dimensional variant of Gecko competes closely with models that are substantially larger in size and higher in embedding dimensions, achieving an average score of 66.31 on MTEB. This performance is a testament to Gecko's efficient design and the innovative use of LLMs in improving text embeddings.
The introduction of Gecko marks a significant step forward in the field of text embeddings and LLM utilization. By demonstrating that knowledge from LLMs can be effectively distilled into compact embedding models, this research opens new avenues for creating efficient, general-purpose text embeddings. The success of Gecko suggests future research could explore further optimization of synthetic data generation processes, refinement methods for distillation, and the potential extensibility of this approach to other languages and tasks. Additionally, the model's compactness combined with its performance highlights the potential for deploying high-quality text embeddings in resource-constrained environments, broadening the accessibility and applicability of advanced NLP technologies.
In summary, Gecko employs a novel approach to text embedding by harnessing the power of LLM-derived synthetic data, resulting in a model that is both compact and versatile. The method of enhancing data quality through the LLM-based identification of relevant passages presents a promising direction for future research in text embeddings and the utilization of LLMs. This research not only demonstrates Gecko's superior performance in a wide range of NLP tasks but also underscores the potential of LLMs in revolutionizing the development of efficient and general-purpose embedding models.
PaRaDe: Passage ranking using demonstrations with LLMs. In H. Bouamor, J. Pino, and K. Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14242–14252, Singapore, Dec. 2023. Association for Computational Linguistics. 10.18653/v1/2023.findings-emnlp.950. https://aclanthology.org/2023.findings-emnlp.950.
G. Izacard and E. Grave. Distilling knowledge from reader to retriever for question answering. In International Conference on Learning Representations, 2021. https://openreview.net/forum?id=NTEz-6wysdb.
Improving passage retrieval with zero-shot question generation. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3781–3797, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics. 10.18653/v1/2022.emnlp-main.249. https://aclanthology.org/2022.emnlp-main.249.