Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
104 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
40 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Gecko: Versatile Text Embeddings Distilled from Large Language Models (2403.20327v1)

Published 29 Mar 2024 in cs.CL and cs.AI

Abstract: We present Gecko, a compact and versatile text embedding model. Gecko achieves strong retrieval performance by leveraging a key idea: distilling knowledge from LLMs into a retriever. Our two-step distillation process begins with generating diverse, synthetic paired data using an LLM. Next, we further refine the data quality by retrieving a set of candidate passages for each query, and relabeling the positive and hard negative passages using the same LLM. The effectiveness of our approach is demonstrated by the compactness of the Gecko. On the Massive Text Embedding Benchmark (MTEB), Gecko with 256 embedding dimensions outperforms all existing entries with 768 embedding size. Gecko with 768 embedding dimensions achieves an average score of 66.31, competing with 7x larger models and 5x higher dimensional embeddings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  2. Task-aware retrieval with instructions. arXiv preprint arXiv:2211.09260, 2022.
  3. Inpars: Data augmentation for information retrieval using large language models. arXiv preprint arXiv:2202.05144, 2022.
  4. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, 2015.
  5. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  6. Universal sentence encoder for english. In Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations, pages 169–174, 2018.
  7. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 758–759, 2009.
  8. Promptagator: Few-shot dense retrieval from 8 examples. arXiv preprint arXiv:2209.11755, 2022.
  9. PaRaDe: Passage ranking using demonstrations with LLMs. In H. Bouamor, J. Pino, and K. Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14242–14252, Singapore, Dec. 2023. Association for Computational Linguistics. 10.18653/v1/2023.findings-emnlp.950. URL https://aclanthology.org/2023.findings-emnlp.950.
  10. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, 2021.
  11. G. Izacard and E. Grave. Distilling knowledge from reader to retriever for question answering. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=NTEz-6wysdb.
  12. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research, 2022.
  13. Inpars-v2: Large language models as efficient dataset generators for information retrieval. arXiv preprint arXiv:2301.01820, 2023.
  14. Dense passage retrieval for open-domain question answering. ArXiv, abs/2004.04906, 2020.
  15. Leveraging llms for unsupervised dense retriever ranking. arXiv preprint arXiv:2402.04853, 2024.
  16. Matryoshka representation learning. Advances in Neural Information Processing Systems, 35:30233–30249, 2022.
  17. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  18. Q. Le and T. Mikolov. Distributed representations of sentences and documents. In International conference on machine learning, pages 1188–1196. PMLR, 2014.
  19. Learning dense representations of phrases at scale. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6634–6647, 2021.
  20. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096, Florence, Italy, July 2019. Association for Computational Linguistics.
  21. Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281, 2023.
  22. Fine-tuning llama for multi-stage text retrieval. arXiv preprint arXiv:2310.08319, 2023.
  23. Samtone: Improving contrastive loss for dual encoder retrieval models with same tower negatives. arXiv preprint arXiv:2306.02516, 2023.
  24. Mteb: Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2006–2029, 2023.
  25. Generative representational instruction tuning. arXiv preprint arXiv:2402.09906, 2024.
  26. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005, 2022.
  27. Large dual encoders are generalizable retrievers. In Conference on Empirical Methods in Natural Language Processing, 2021.
  28. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1864–1874, 2022.
  29. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning, pages 248–260. PMLR, 2022.
  30. Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5835–5847, 2021.
  31. N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, 2019.
  32. RocketQAv2: A joint training method for dense passage retrieval and passage re-ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2825–2835, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics.
  33. Improving passage retrieval with zero-shot question generation. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3781–3797, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics. 10.18653/v1/2022.emnlp-main.249. URL https://aclanthology.org/2022.emnlp-main.249.
  34. Questions are all you need to train a dense passage retriever. Transactions of the Association for Computational Linguistics, 11:600–616, 2023.
  35. Colbertv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3715–3734, 2022.
  36. Repetition improves language model embeddings. arXiv preprint arXiv:2402.15449, 2024.
  37. One embedder, any task: Instruction-finetuned text embeddings. arXiv preprint arXiv:2212.09741, 2022.
  38. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  39. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  40. Fever: a large-scale dataset for fact extraction and verification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, 2018.
  41. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
  42. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368, 2023.
  43. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of NAACL-HLT, pages 1112–1122, 2018.
  44. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808, 2020.
  45. mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, 2021.
  46. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018.
  47. Miracl: A multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for Computational Linguistics, 11:1114–1131, 2023.
  48. Beyond yes and no: Improving zero-shot llm rankers via scoring fine-grained relevance labels. arXiv preprint arXiv:2310.14122, 2023.
Citations (25)

Summary

  • The paper introduces Gecko, which distills large language models into compact embeddings using a two-step synthetic data generation and refinement process.
  • The methodology leverages LLMs to generate and rank synthetic paired data, significantly enhancing the identification of positive and hard negative passages.
  • The 256-dimensional Gecko model outperforms larger embeddings on the MTEB benchmark, demonstrating efficiency and robust performance in various NLP tasks.

Gecko: Versatile Text Embeddings Distilled from LLMs

Introduction to Gecko

The research introduced by Jinhyuk Lee et al. presents Gecko, a compact yet versatile text embedding model, remarkable for its capability to leverage knowledge from LLMs. This approach streamlines the embedding model's ability to perform across a broad spectrum of tasks, including document retrieval, sentence similarity, classification, and clustering, by distilling knowledge from LLMs through a novel two-step process. The pivotal mechanism behind Gecko's efficacy is the generation and subsequent refining of synthetic paired data using LLMs, emphasizing the rediscovery and relabeling of positive and hard negative passages for enhanced data quality.

Distilling Knowledge from LLMs

Gecko pioneers in distilling the vast knowledge embedded within LLMs to enhance text embedding models. This process is facilitated through a two-step procedure where initially, diverse synthetic paired data is generated using few-shot prompts facilitated by LLMs. Following this, a refinement of data quality is performed by retrieving candidate passages and using LLMs to rank these passages—emphatically identifying more relevant positive and hard negative passages than the originally generated pairs. This meticulous approach not only enhances the quality of the synthetic data but redefines standard practices in identifying positive targets for query-generated passages.

Unveiling Gecko's Efficacy

The compactness of the Gecko model does not compromise its effectiveness. Impressively, it outperforms existing models with significantly larger embedding sizes on the Massive Text Embedding Benchmark (MTEB). Specifically, Gecko with 256 embedding dimensions sets a new standard by surpassing all entries with 768 embedding dimensions. Further extending its capabilities, the 768-dimensional variant of Gecko competes closely with models that are substantially larger in size and higher in embedding dimensions, achieving an average score of 66.31 on MTEB. This performance is a testament to Gecko's efficient design and the innovative use of LLMs in improving text embeddings.

Implications and Future Directions

The introduction of Gecko marks a significant step forward in the field of text embeddings and LLM utilization. By demonstrating that knowledge from LLMs can be effectively distilled into compact embedding models, this research opens new avenues for creating efficient, general-purpose text embeddings. The success of Gecko suggests future research could explore further optimization of synthetic data generation processes, refinement methods for distillation, and the potential extensibility of this approach to other languages and tasks. Additionally, the model's compactness combined with its performance highlights the potential for deploying high-quality text embeddings in resource-constrained environments, broadening the accessibility and applicability of advanced NLP technologies.

Conclusion

In summary, Gecko employs a novel approach to text embedding by harnessing the power of LLM-derived synthetic data, resulting in a model that is both compact and versatile. The method of enhancing data quality through the LLM-based identification of relevant passages presents a promising direction for future research in text embeddings and the utilization of LLMs. This research not only demonstrates Gecko's superior performance in a wide range of NLP tasks but also underscores the potential of LLMs in revolutionizing the development of efficient and general-purpose embedding models.

Youtube Logo Streamline Icon: https://streamlinehq.com