Emergent Mind

Gecko: Versatile Text Embeddings Distilled from Large Language Models

(2403.20327)
Published Mar 29, 2024 in cs.CL and cs.AI

Abstract

We present Gecko, a compact and versatile text embedding model. Gecko achieves strong retrieval performance by leveraging a key idea: distilling knowledge from LLMs into a retriever. Our two-step distillation process begins with generating diverse, synthetic paired data using an LLM. Next, we further refine the data quality by retrieving a set of candidate passages for each query, and relabeling the positive and hard negative passages using the same LLM. The effectiveness of our approach is demonstrated by the compactness of the Gecko. On the Massive Text Embedding Benchmark (MTEB), Gecko with 256 embedding dimensions outperforms all existing entries with 768 embedding size. Gecko with 768 embedding dimensions achieves an average score of 66.31, competing with 7x larger models and 5x higher dimensional embeddings.

Overview

  • Gecko is a new text embedding model that uses knowledge from LLMs to perform a variety of tasks like document retrieval and sentence similarity.

  • It involves a two-step process of generating and refining synthetic paired data using LLMs, focusing on improving data quality by identifying more relevant passages.

  • Gecko outperforms larger models in the Massive Text Embedding Benchmark (MTEB), even with smaller embedding sizes, highlighting its efficiency and effectiveness.

  • The research suggests future exploration in optimizing data generation, refining distillation methods, and extending this approach to more languages and tasks.

Gecko: Versatile Text Embeddings Distilled from LLMs

Introduction to Gecko

The research introduced by Jinhyuk Lee et al. presents Gecko, a compact yet versatile text embedding model, remarkable for its capability to leverage knowledge from LLMs. This approach streamlines the embedding model's ability to perform across a broad spectrum of tasks, including document retrieval, sentence similarity, classification, and clustering, by distilling knowledge from LLMs through a novel two-step process. The pivotal mechanism behind Gecko's efficacy is the generation and subsequent refining of synthetic paired data using LLMs, emphasizing the rediscovery and relabeling of positive and hard negative passages for enhanced data quality.

Distilling Knowledge from LLMs

Gecko pioneers in distilling the vast knowledge embedded within LLMs to enhance text embedding models. This process is facilitated through a two-step procedure where initially, diverse synthetic paired data is generated using few-shot prompts facilitated by LLMs. Following this, a refinement of data quality is performed by retrieving candidate passages and using LLMs to rank these passages—emphatically identifying more relevant positive and hard negative passages than the originally generated pairs. This meticulous approach not only enhances the quality of the synthetic data but redefines standard practices in identifying positive targets for query-generated passages.

Unveiling Gecko's Efficacy

The compactness of the Gecko model does not compromise its effectiveness. Impressively, it outperforms existing models with significantly larger embedding sizes on the Massive Text Embedding Benchmark (MTEB). Specifically, Gecko with 256 embedding dimensions sets a new standard by surpassing all entries with 768 embedding dimensions. Further extending its capabilities, the 768-dimensional variant of Gecko competes closely with models that are substantially larger in size and higher in embedding dimensions, achieving an average score of 66.31 on MTEB. This performance is a testament to Gecko's efficient design and the innovative use of LLMs in improving text embeddings.

Implications and Future Directions

The introduction of Gecko marks a significant step forward in the field of text embeddings and LLM utilization. By demonstrating that knowledge from LLMs can be effectively distilled into compact embedding models, this research opens new avenues for creating efficient, general-purpose text embeddings. The success of Gecko suggests future research could explore further optimization of synthetic data generation processes, refinement methods for distillation, and the potential extensibility of this approach to other languages and tasks. Additionally, the model's compactness combined with its performance highlights the potential for deploying high-quality text embeddings in resource-constrained environments, broadening the accessibility and applicability of advanced NLP technologies.

Conclusion

In summary, Gecko employs a novel approach to text embedding by harnessing the power of LLM-derived synthetic data, resulting in a model that is both compact and versatile. The method of enhancing data quality through the LLM-based identification of relevant passages presents a promising direction for future research in text embeddings and the utilization of LLMs. This research not only demonstrates Gecko's superior performance in a wide range of NLP tasks but also underscores the potential of LLMs in revolutionizing the development of efficient and general-purpose embedding models.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
References
  1. PaLM 2 Technical Report
  2. Task-aware Retrieval with Instructions
  3. InPars: Data Augmentation for Information Retrieval using Large Language Models
  4. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642
  5. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual
  6. Universal sentence encoder for english. In Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations, pages 169–174
  7. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 758–759
  8. Promptagator: Few-shot Dense Retrieval From 8 Examples
  9. PaRaDe: Passage ranking using demonstrations with LLMs. In H. Bouamor, J. Pino, and K. Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14242–14252, Singapore, Dec. 2023. Association for Computational Linguistics. 10.18653/v1/2023.findings-emnlp.950. https://aclanthology.org/2023.findings-emnlp.950.

  10. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910
  11. G. Izacard and E. Grave. Distilling knowledge from reader to retriever for question answering. In International Conference on Learning Representations, 2021. https://openreview.net/forum?id=NTEz-6wysdb.

  12. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research
  13. InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval
  14. Dense Passage Retrieval for Open-Domain Question Answering
  15. Leveraging LLMs for Unsupervised Dense Retriever Ranking
  16. Matryoshka representation learning. Advances in Neural Information Processing Systems, 35:30233–30249
  17. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466
  18. Q. Le and T. Mikolov. Distributed representations of sentences and documents. In International conference on machine learning, pages 1188–1196. PMLR
  19. Learning dense representations of phrases at scale. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6634–6647
  20. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096, Florence, Italy, July 2019. Association for Computational Linguistics.
  21. Towards General Text Embeddings with Multi-stage Contrastive Learning
  22. Fine-Tuning LLaMA for Multi-Stage Text Retrieval
  23. SamToNe: Improving Contrastive Loss for Dual Encoder Retrieval Models with Same Tower Negatives
  24. Mteb: Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2006–2029
  25. Generative Representational Instruction Tuning
  26. Text and Code Embeddings by Contrastive Pre-Training
  27. Large dual encoders are generalizable retrievers. In Conference on Empirical Methods in Natural Language Processing
  28. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1864–1874
  29. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning, pages 248–260. PMLR
  30. Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5835–5847
  31. N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992
  32. RocketQAv2: A joint training method for dense passage retrieval and passage re-ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2825–2835, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics.
  33. Improving passage retrieval with zero-shot question generation. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3781–3797, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics. 10.18653/v1/2022.emnlp-main.249. https://aclanthology.org/2022.emnlp-main.249.

  34. Questions are all you need to train a dense passage retriever. Transactions of the Association for Computational Linguistics, 11:600–616
  35. Colbertv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3715–3734
  36. Repetition Improves Language Model Embeddings
  37. One Embedder, Any Task: Instruction-Finetuned Text Embeddings
  38. Gemini: A Family of Highly Capable Multimodal Models
  39. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)
  40. Fever: a large-scale dataset for fact extraction and verification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819
  41. Text Embeddings by Weakly-Supervised Contrastive Pre-training
  42. Improving Text Embeddings with Large Language Models
  43. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of NAACL-HLT, pages 1112–1122
  44. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval
  45. mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498
  46. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380
  47. Miracl: A multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for Computational Linguistics, 11:1114–1131
  48. Beyond Yes and No: Improving Zero-Shot LLM Rankers via Scoring Fine-Grained Relevance Labels

Show All 48