Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models (2405.17428v1)

Published 27 May 2024 in cs.CL, cs.AI, cs.IR, and cs.LG
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Abstract: Decoder-only LLM-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retrieval. In this work, we introduce the NV-Embed model with a variety of architectural designs and training procedures to significantly enhance the performance of LLM as a versatile embedding model, while maintaining its simplicity and reproducibility. For model architecture, we propose a latent attention layer to obtain pooled embeddings, which consistently improves retrieval and downstream task accuracy compared to mean pooling or using the last <EOS> token embedding from LLMs. To enhance representation learning, we remove the causal attention mask of LLMs during contrastive training. For model training, we introduce a two-stage contrastive instruction-tuning method. It first applies contrastive training with instructions on retrieval datasets, utilizing in-batch negatives and curated hard negative examples. At stage-2, it blends various non-retrieval datasets into instruction tuning, which not only enhances non-retrieval task accuracy but also improves retrieval performance. Combining these techniques, our NV-Embed model, using only publicly available data, has achieved a record-high score of 69.32, ranking No. 1 on the Massive Text Embedding Benchmark (MTEB) (as of May 24, 2024), with 56 tasks, encompassing retrieval, reranking, classification, clustering, and semantic textual similarity tasks. Notably, our model also attains the highest score of 59.36 on 15 retrieval tasks in the MTEB benchmark (also known as BEIR). We will open-source the model at: https://huggingface.co/nvidia/NV-Embed-v1.

The NV-Embed Model: Advancements in Decoder-Only LLMs for Text Embedding Tasks

This paper presents the NV-Embed model, which pushes the boundaries of decoder-only LLMs as versatile text embedding models. Current trends in text embedding have shown that decoder-only LLM-based models have begun to outperform traditional bidirectional models such as BERT and T5 in general-purpose text embedding tasks, including dense vector-based retrieval. NV-Embed incorporates these advancements with novel architectural and training modifications to achieve better performance while maintaining simplicity and reproducibility.

Architectural Innovations

A prominent feature of NV-Embed is the introduction of a latent attention layer specifically designed to extract pooled embeddings from sequences of tokens. In traditional approaches, embeddings are obtained using either mean pooling or the embedding of the last <EOS> token. Both of these methods have their limitations: mean pooling might dilute crucial semantic information distributed across the token sequence, whereas the last token embedding could suffer from recency bias. The latent attention layer proposed in NV-Embed mitigates these issues by employing a form of cross-attention where the hidden states serve as queries and the keys and values come from a trainable latent array. This setup enables the model to better capture and represent the complex structure of the input sequence.

Another significant architectural choice is the removal of the causal attention mask during contrastive training. While the causal mask in decoder-only LLMs is essential for preserving autoregressive properties in generation tasks, it limits the model's ability to fully utilize bidirectional context when learning embeddings. By simply eliminating the causal mask, NV-Embed harnesses the full potential of bidirectional attention, which enhances representation learning without the need for additional complex training phases, as seen in related works.

Training Enhancements

On the training side, NV-Embed employs a two-stage contrastive instruction-tuning approach. The first stage focuses on retrieval datasets, applying contrastive training with in-batch negatives and curated hard negatives. After an initial contrastive training phase, the model undergoes a second stage blending various non-retrieval datasets into instruction tuning. This stage is designed to improve not only non-retrieval tasks such as classification, clustering, and semantic textual similarity (STS) but also to provide unexpected gains in retrieval performance.

This two-stage methodology is distinct from previous models, which often do not separate the stages based on task difficulties or characteristics but rather apply a unified training strategy across all tasks. Through these meticulous design choices, NV-Embed delivers a model that achieves remarkable scores in diverse embedding benchmarks without relying on proprietary synthetic data, underscoring its reproducibility using entirely public datasets.

Empirical Results

The empirical evaluation of NV-Embed demonstrates its competitive edge. Achieving a top score of 69.32 on the Massive Text Embedding Benchmark (MTEB), which includes 56 tasks ranging from retrieval and reranking to classification and semantic similarity, it surpasses prior leading models like E5-Mistral-7B-Instruct and SFR-Embedding. Notably, NV-Embed sets a new record of 59.36 on the BEIR retrieval benchmark, indicative of its superior ability in dense vector-based retrieval tasks.

Implications and Future Directions

The contributions of NV-Embed have significant implications for the field of text embeddings. Architecturally, the use of a latent attention layer could inspire further research into more effective pooling techniques for complex sequence representations. Meanwhile, the removal of the causal mask invites discussions on simplifying bidirectional and decoder-based architectures to enhance their effectiveness across diverse tasks.

Practically, NV-Embed's ability to achieve state-of-the-art performance using publicly available data demystifies and democratizes high-performance text embeddings, making them accessible to a broader research community and enabling wider application domains that may not have access to proprietary datasets.

Future directions could include exploring more sophisticated training strategies that blend different tasks dynamically based on model feedback or extending latent attention methods to other types of models and tasks. There is also potential to further optimize the latent attention mechanism and examine its adaptability in transformer variants beyond decoder-only models.

In conclusion, NV-Embed marks a significant step forward in the application of decoder-only LLMs for text embedding tasks. Through innovative architectural designs and strategic training methods, it sets new performance benchmarks while emphasizing simplicity and reproducibility, thereby broadening the scope and accessibility of advanced text embedding research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Jigsaw unintended bias in toxicity classification, 2019. URL https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification.
  2. SemEval-2012 task 6: A pilot on semantic textual similarity. In Agirre, E., Bos, J., Diab, M., Manandhar, S., Marton, Y., and Yuret, D. (eds.), *SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pp.  385–393, Montréal, Canada, 7-8 June 2012. Association for Computational Linguistics. URL https://aclanthology.org/S12-1051.
  3. Task-aware retrieval with instructions. arXiv preprint arXiv:2211.09260, 2022.
  4. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016.
  5. Llm2vec: Large language models are secretly powerful text encoders. arXiv preprint arXiv:2404.05961, 2024.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on NLP for ConvAI - ACL 2020, mar 2020. URL https://arxiv.org/abs/2003.04807. Data available at https://github.com/PolyAI-LDN/task-specific-datasets.
  8. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Bethard, S., Carpuat, M., Apidianaki, M., Mohammad, S. M., Cer, D., and Jurgens, D. (eds.), Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp.  1–14, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/S17-2001. URL https://aclanthology.org/S17-2001.
  9. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2023.
  10. SemEval-2022 task 8: Multilingual news article similarity. In Emerson, G., Schluter, N., Stanovsky, G., Kumar, R., Palmer, A., Schneider, N., Singh, S., and Ratan, S. (eds.), Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pp.  1094–1106, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.semeval-1.155. URL https://aclanthology.org/2022.semeval-1.155.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  12. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821, 2021.
  13. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  14. Group, S. N. et al. The stanford natural language inference (snli) corpus, 2022.
  15. Retrieval augmented language model pre-training. In International conference on machine learning, pp.  3929–3938. PMLR, 2020.
  16. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  17. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118, 2021.
  18. Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795, 2021.
  19. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  20. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.
  21. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  22. Lang, K. Newsweeder: Learning to filter netnews. In Machine learning proceedings 1995, pp.  331–339. Elsevier, 1995.
  23. Gecko: Versatile text embeddings distilled from large language models. arXiv preprint arXiv:2403.20327, 2024a.
  24. Open source strikes bread - new fluffy embeddings model, 2024b. URL https://www.mixedbread.ai/blog/mxbai-embed-large-v1.
  25. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  26. Paq: 65 million probably-asked questions and what you can do with them. Transactions of the Association for Computational Linguistics, 9:1098–1115, 2021.
  27. Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  175–184, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.emnlp-demo.21.
  28. MTOP: A comprehensive multilingual task-oriented semantic parsing benchmark. In Merlo, P., Tiedemann, J., and Tsarfaty, R. (eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp.  2950–2962, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.257. URL https://aclanthology.org/2021.eacl-main.257.
  29. Angle-optimized text embeddings. arXiv preprint arXiv:2309.12871, 2023. URL https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1.
  30. Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281, 2023.
  31. ChatQA: Surpassing GPT-4 on conversational QA and RAG. arXiv preprint arXiv:2401.10225, 2024.
  32. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.  142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
  33. Tweet sentiment extraction, 2020. URL https://kaggle.com/competitions/tweet-sentiment-extraction.
  34. Www’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018, pp.  1941–1942, 2018.
  35. Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM Conference on Recommender Systems, RecSys ’13, pp.  165–172, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450324090. doi: 10.1145/2507157.2507163. URL https://doi.org/10.1145/2507157.2507163.
  36. Sfrembedding-mistral: enhance text retrieval with transfer learning. Salesforce AI Research Blog, 3, 2024.
  37. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 2013.
  38. MTEB: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316, 2022.
  39. Generative representational instruction tuning. arXiv preprint arXiv:2402.09906, 2024.
  40. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005, 2022.
  41. MS MARCO: A human-generated machine reading comprehension dataset. 2016.
  42. Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899, 2021.
  43. I wish i would have loved this one, but i didn’t–a multilingual dataset for counterfactual detection in product reviews. arXiv preprint arXiv:2104.06893, 2021.
  44. OpenAI. GPT-4, 2023.
  45. OpenAI. New embedding models and api updates, 2024.
  46. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 2022.
  47. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  48. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
  49. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  50. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
  51. CARER: Contextualized affect representations for emotion recognition. In Riloff, E., Chiang, D., Hockenmaier, J., and Tsujii, J. (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  3687–3697, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1404. URL https://aclanthology.org/D18-1404.
  52. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023.
  53. Stack-Exchange-Community. Stack exchange data dump, 2023.
  54. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663, 2021.
  55. Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355, 2018.
  56. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 16:1–28, 2015.
  57. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  58. Voyage-AI. voyage-large-2-instruct: Instruction-tuned and rank 1 on mteb, 2024.
  59. Retrieval of the best counterargument without prior topic knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  241–251, 2018.
  60. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019.
  61. Instructretro: Instruction tuning post retrieval-augmented pretraining. arXiv preprint arXiv:2310.07713, 2023a.
  62. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
  63. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368, 2023b.
  64. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International conference on machine learning, pp.  5180–5189. PMLR, 2018.
  65. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  66. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Chankyu Lee (12 papers)
  2. Rajarshi Roy (55 papers)
  3. Mengyao Xu (5 papers)
  4. Jonathan Raiman (17 papers)
  5. Mohammad Shoeybi (60 papers)
  6. Bryan Catanzaro (123 papers)
  7. Wei Ping (51 papers)
Citations (58)
Youtube Logo Streamline Icon: https://streamlinehq.com