Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Text Embeddings with Large Language Models (2401.00368v3)

Published 31 Dec 2023 in cs.CL and cs.IR
Improving Text Embeddings with Large Language Models

Abstract: In this paper, we introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. Unlike existing methods that often depend on multi-stage intermediate pre-training with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelines or relying on manually collected datasets that are often constrained by task diversity and language coverage. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages. We then fine-tune open-source decoder-only LLMs on the synthetic data using standard contrastive loss. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data. Furthermore, when fine-tuned with a mixture of synthetic and labeled data, our model sets new state-of-the-art results on the BEIR and MTEB benchmarks.

Introduction

Text embeddings are compact vector representations designed to capture the semantic essence of textual content, facilitating their use in a variety of natural language processing tasks. These tasks include information retrieval, machine translation, and semantic analysis, where retrieval efficiency and accuracy greatly depend on the quality of these embeddings. Traditional methods for learning text embeddings often involve complex pipelines and multistage training on large volumes of weakly labeled data, followed by fine-tuning on more refined datasets.

Novel Approach to Text Embeddings

In contrast to these multilayered processes, this paper introduces a new, streamlined method that leverages LLMs to create text embeddings with competitive performance across numerous tasks and languages without the need for labeled training data. This approach generates synthetic data through a combination of brainstorming and generation from LLMs, enabling a variety of language-types and tasks to be covered. Decoder-only LLMs like Mistral are then fine-tuned using this synthetic data with a standard contrastive loss, yielding robust results.

Experiments and Findings

Experiments show that this fine-tuned model, Mistral-7B, achieves impressive results when compared to state-of-the-art on benchmarks like BEIR and MTEB using only synthetic data. When incorporating a mix of synthetic and labeled data, the performance is further elevated, establishing new records on these benchmarks with just under 1k training steps. Furthermore, the model shows potential for handling extended context lengths and multilingual representation, although it highlights a need for more diverse pre-training for low-resource languages.

Conclusion and Future Work

This paper underscores the potential to significantly enhance text embeddings by utilizing LLMs to generate synthetic data, thereby simplifying and expediting the training process. While high-resource languages benefit most from the approach, future research could expand the model's multilingual capabilities and efficiency, potentially even forgoing the reliance on proprietary LLMs for synthetic data generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. A simple but tough-to-beat baseline for sentence embeddings. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=SyK00v5xx.
  2. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal, 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1075. URL https://aclanthology.org/D15-1075.
  3. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
  4. Ms marco: A human generated machine reading comprehension dataset. ArXiv preprint, abs/1611.09268, 2016. URL https://arxiv.org/abs/1611.09268.
  5. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680, Copenhagen, Denmark, 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1070. URL https://aclanthology.org/D17-1070.
  6. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747.
  7. Promptagator: Few-shot dense retrieval from 8 examples. In The Eleventh International Conference on Learning Representations, 2022.
  8. Quora question pairs, 2017. URL https://kaggle.com/competitions/quora-question-pairs.
  9. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–407, 1990.
  10. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  11. ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, Florence, Italy, 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1346. URL https://aclanthology.org/P19-1346.
  12. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.552. URL https://aclanthology.org/2021.emnlp-main.552.
  13. Enabling large language models to generate text with citations. ArXiv preprint, abs/2305.14627, 2023. URL https://arxiv.org/abs/2305.14627.
  14. Textbooks are all you need. ArXiv preprint, abs/2306.11644, 2023. URL https://arxiv.org/abs/2306.11644.
  15. Unnatural instructions: Tuning language models with (almost) no human labor. ArXiv preprint, abs/2212.09689, 2022. URL https://arxiv.org/abs/2212.09689.
  16. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  17. Towards unsupervised dense information retrieval with contrastive learning. ArXiv preprint, abs/2112.09118, 2021. URL https://arxiv.org/abs/2112.09118.
  18. Mistral 7b. ArXiv preprint, abs/2310.06825, 2023. URL https://arxiv.org/abs/2310.06825.
  19. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.550. URL https://aclanthology.org/2020.emnlp-main.550.
  20. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html.
  21. Angle-optimized text embeddings. ArXiv preprint, abs/2309.12871, 2023. URL https://arxiv.org/abs/2309.12871.
  22. Towards general text embeddings with multi-stage contrastive learning. ArXiv preprint, abs/2308.03281, 2023. URL https://arxiv.org/abs/2308.03281.
  23. Fine-tuning llama for multi-stage text retrieval. ArXiv preprint, abs/2310.08319, 2023. URL https://arxiv.org/abs/2310.08319.
  24. Efficient estimation of word representations in vector space. In ICLR, 2013.
  25. Landmark attention: Random-access infinite context length for transformers. ArXiv preprint, abs/2305.16300, 2023. URL https://arxiv.org/abs/2305.16300.
  26. Niklas Muennighoff. Sgpt: Gpt sentence embeddings for semantic search. ArXiv preprint, abs/2202.08904, 2022. URL https://arxiv.org/abs/2202.08904.
  27. MTEB: Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037, Dubrovnik, Croatia, 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.eacl-main.148.
  28. Orca: Progressive learning from complex explanation traces of gpt-4. ArXiv preprint, abs/2306.02707, 2023. URL https://arxiv.org/abs/2306.02707.
  29. Text and code embeddings by contrastive pre-training. ArXiv preprint, abs/2201.10005, 2022. URL https://arxiv.org/abs/2201.10005.
  30. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1864–1874, Dublin, Ireland, 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.146. URL https://aclanthology.org/2022.findings-acl.146.
  31. Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9844–9855, Abu Dhabi, United Arab Emirates, 2022b. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.669.
  32. Document expansion by query prediction. ArXiv preprint, abs/1904.08375, 2019. URL https://arxiv.org/abs/1904.08375.
  33. OpenAI. Gpt-4 technical report. ArXiv preprint, abs/2303.08774, 2023. URL https://arxiv.org/abs/2303.08774.
  34. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1162. URL https://aclanthology.org/D14-1162.
  35. DuReader-retrieval: A large-scale Chinese benchmark for passage retrieval from web search engine. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5326–5338, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.357.
  36. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China, 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1410. URL https://aclanthology.org/D19-1410.
  37. Code llama: Open foundation models for code. ArXiv preprint, abs/2308.12950, 2023. URL https://arxiv.org/abs/2308.12950.
  38. One embedder, any task: Instruction-finetuned text embeddings. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1102–1121, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.71. URL https://aclanthology.org/2023.findings-acl.71.
  39. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  40. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  41. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana, 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1074. URL https://aclanthology.org/N18-1074.
  42. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023. URL https://arxiv.org/abs/2307.09288.
  43. GPL: Generative pseudo labeling for unsupervised domain adaptation of dense retrieval. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2345–2360, Seattle, United States, 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.168. URL https://aclanthology.org/2022.naacl-main.168.
  44. Text embeddings by weakly-supervised contrastive pre-training. ArXiv preprint, abs/2212.03533, 2022b. URL https://arxiv.org/abs/2212.03533.
  45. Query2doc: Query expansion with large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9414–9423, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.585. URL https://aclanthology.org/2023.emnlp-main.585.
  46. C-pack: Packaged resources to advance general chinese embedding. ArXiv preprint, abs/2309.07597, 2023. URL https://arxiv.org/abs/2309.07597.
  47. T2ranking: A large-scale chinese benchmark for passage ranking. ArXiv preprint, abs/2304.03679, 2023. URL https://arxiv.org/abs/2304.03679.
  48. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium, 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1259. URL https://aclanthology.org/D18-1259.
  49. Language models are universal embedders. ArXiv preprint, abs/2310.08232, 2023a. URL https://arxiv.org/abs/2310.08232.
  50. Mr. TyDi: A multi-lingual benchmark for dense retrieval. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 127–137, Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.mrl-1.12. URL https://aclanthology.org/2021.mrl-1.12.
  51. Miracl: A multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for Computational Linguistics, 11:1114–1131, 2023b.
  52. Pose: Efficient context window extension of llms via positional skip-wise training. ArXiv preprint, abs/2309.10400, 2023. URL https://arxiv.org/abs/2309.10400.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Liang Wang (512 papers)
  2. Nan Yang (182 papers)
  3. Xiaolong Huang (29 papers)
  4. Linjun Yang (16 papers)
  5. Rangan Majumder (12 papers)
  6. Furu Wei (291 papers)
Citations (102)
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews