Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Making Large Language Models A Better Foundation For Dense Retrieval (2312.15503v1)

Published 24 Dec 2023 in cs.CL
Making Large Language Models A Better Foundation For Dense Retrieval

Abstract: Dense retrieval needs to learn discriminative text embeddings to represent the semantic relationship between query and document. It may benefit from the using of LLMs, given LLMs' strong capability on semantic understanding. However, the LLMs are pre-trained by text generation tasks, whose working pattern is completely different from representing texts as embeddings. As a result, it is imperative to study how to adapt LLMs properly so that they can be effectively initialized as the backbone encoder for dense retrieval. In this paper, we propose a novel approach, called LLaRA (LLM adapted for dense RetrievAl), which works as a post-hoc adaptation of LLM for the dense retrieval application. LLaRA consists of two pretext tasks: EBAE (Embedding-Based Auto-Encoding) and EBAR (Embedding-Based Auto-Regression), where the text embeddings from LLM are used to reconstruct the tokens for the input sentence and predict the tokens for the next sentence, respectively. LLaRA turns out to be simple, lightweight, and highly effective. It is applied to adapt LLaMA-2-7B (base) on the Wikipedia corpus, where it substantially improves the model's fine-tuned performances on a variety of dense retrieval benchmarks, like MSMARCO and BEIR. Our model and code will be made publicly available at BGE repository.

Overview of "Making LLMs A Better Foundation For Dense Retrieval"

The paper, "Making LLMs A Better Foundation For Dense Retrieval," presents a methodology designed to refine LLMs for dense retrieval tasks. This approach, termed LLaRA (LLM adapted for dense Retrieval Application), emerges in response to challenges faced when directly applying LLMs, pre-trained on generative tasks, to dense retrieval scenarios.

Methodology: LLaRA

The central contribution of this work is the LLaRA framework, a post-hoc adaptation process for LLMs. LLaRA introduces two pretext tasks: Embedding-Based Auto-Encoding (EBAE) and Embedding-Based Auto-Regression (EBAR). Both tasks are structured to modify the embeddings produced by LLMs, shifting their focus from local semantic contexts (as suited for text generation) to global semantic representations, which better serve dense retrieval purposes.

  • EBAE (Embedding-Based Auto-Encoding): This task prompts the LLM to generate embeddings that can accurately reconstruct the original input sentence, ensuring the embedding captures the entire semantic context.
  • EBAR (Embedding-Based Auto-Regression): Here, the LLM is tasked with using embeddings to predict the following sentence, aligning more closely with the semantic relationships required in retrieval tasks.

The framework utilizes these embeddings to perform retrieval tasks efficiently and accurately, without requiring additional decoding steps.

Experimental Results

The paper reports significant improvements across multiple benchmarks post-adaptation:

  1. MS MARCO Passage Retrieval: LLaRA attains an MRR@10 of 43.1, outperforming several established methods, including those using knowledge distillation and large-scale models such as GTR-XXL.
  2. MS MARCO Document Retrieval: The methodology yields an MRR@100 of 47.5, indicating superior performance even when compared to incumbent approaches such as PROP and ANCE.
  3. BEIR Zero-Shot Retrieval Benchmark: In diverse retrieval scenarios, LLaRA demonstrates high generality and versatility, showcasing an average NDCG@10 of 56.1, surpassing various specialized models.

Implications and Future Directions

The findings suggest that with proper adaptation, LLMs can significantly enhance dense retrieval systems, making them more robust and versatile across various tasks. Importantly, this adaptation does not necessitate labeled data, leveraging unsupervised learning mechanisms to yield competitive performance improvements.

From a practical standpoint, LLaRA's efficiency potentially lowers the compute and resource requirements typically associated with large-scale pre-training, enabling broader use in industry and academia. The approach could catalyze further research into optimizing LLMs for diverse NLP applications beyond their initial generative capabilities.

Future research directions might explore integrating LLaRA with even larger LLMs or combining it with fine-tuning techniques that exploit larger labeled datasets, further boosting retrieval accuracy and application scope. Additionally, its application could extend to identify and refine more nuanced tasks within the retrieval landscape, opening avenues for innovations in complex real-world scenarios.

In conclusion, by overcoming the intrinsic limitations of using LLMs for dense retrieval, this paper lays a foundation for more efficient and effective retrieval systems, suggesting a promising trajectory for both academic exploration and practical implementations in information retrieval contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  3. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  4. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,, pages 4171–4186. Association for Computational Linguistics.
  5. Luyu Gao and Jamie Callan. 2021. Condenser: a pre-training architecture for dense retrieval. arXiv preprint arXiv:2104.08253.
  6. Luyu Gao and Jamie Callan. 2022. Unsupervised corpus aware language model pre-training for dense passage retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2843–2853, Dublin, Ireland.
  7. Coil: Revisit exact lexical match in information retrieval with contextualized inverted list. arXiv preprint arXiv:2104.07186.
  8. Efficiently teaching an effective dense retriever with balanced topic aware sampling. In SIGIR, pages 113–122.
  9. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  10. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118.
  11. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906.
  12. Pyserini: A python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2356–2362.
  13. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
  14. Zheng Liu and Yingxia Shao. 2022. Retromae: Pre-training retrieval-oriented transformers via masked auto-encoder. arXiv preprint arXiv:2205.12035.
  15. Retromae-2: Duplex masked auto-encoder for pre-training retrieval-oriented language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2635–2648.
  16. Twinbert: Distilling knowledge to twin-structured compressed bert models for large-scale retrieval. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 2645–2652.
  17. Pre-train a discriminative text encoder for dense retrieval via contrastive span prediction. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 848–858.
  18. Prop: Pre-training with representative words prediction for ad-hoc retrieval. In Proceedings of the 14th ACM international conference on web search and data mining, pages 283–291.
  19. B-prop: bootstrapped pre-training with representative words prediction for ad-hoc retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1513–1522.
  20. Fine-tuning llama for multi-stage text retrieval. arXiv preprint arXiv:2310.08319.
  21. Niklas Muennighoff. 2022. Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
  22. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005.
  23. Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660.
  24. Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899.
  25. Multi-stage document ranking with bert. arXiv preprint arXiv:1910.14424.
  26. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  27. Rocketqav2: A joint training method for dense passage retrieval and passage re-ranking. arXiv preprint arXiv:2110.07367.
  28. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663.
  29. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv. org/abs/2307.09288.
  30. Simlm: Pre-training with representation bottleneck for dense passage retrieval. arXiv preprint arXiv:2207.02578.
  31. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533.
  32. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  33. C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597.
  34. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808.
  35. Optimizing dense retrieval model training with hard negatives. In SIGIR, pages 1503–1512.
  36. Language models are universal embedders. arXiv preprint arXiv:2310.08232.
  37. Simans: Simple ambiguous negatives sampling for dense text retrieval. arXiv preprint arXiv:2210.11773.
  38. Rankt5: Fine-tuning t5 for text ranking with ranking losses. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2308–2313.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Chaofan Li (13 papers)
  2. Zheng Liu (312 papers)
  3. Shitao Xiao (38 papers)
  4. Yingxia Shao (54 papers)
Citations (24)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com