Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Surface-Based Retrieval Reduces Perplexity of Retrieval-Augmented Language Models (2305.16243v3)

Published 25 May 2023 in cs.CL

Abstract: Augmenting LLMs with a retrieval mechanism has been shown to significantly improve their performance while keeping the number of parameters low. Retrieval-augmented models commonly rely on a semantic retrieval mechanism based on the similarity between dense representations of the query chunk and potential neighbors. In this paper, we study the state-of-the-art Retro model and observe that its performance gain is better explained by surface-level similarities, such as token overlap. Inspired by this, we replace the semantic retrieval in Retro with a surface-level method based on BM25, obtaining a significant reduction in perplexity. As full BM25 retrieval can be computationally costly for large datasets, we also apply it in a re-ranking scenario, gaining part of the perplexity reduction with minimal computational overhead.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Learning to retrieve reasoning paths over wikipedia graph for question answering. In International Conference on Learning Representations.
  2. Improving language models by retrieving from trillions of tokens. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 2206–2240. PMLR.
  3. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  4. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  5. The pile: An 800gb dataset of diverse text for language modeling. CoRR, abs/2101.00027.
  6. Generating sentences by editing prototypes. Transactions of the Association for Computational Linguistics, 6:437–450.
  7. Realm: Retrieval-augmented language model pre-training. In International Conference on Machine Learning, pages 3929–3938. PMLR.
  8. HuggingFace. Huggingface T5.
  9. Gautier Izacard and Edouard Grave. 2021a. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online. Association for Computational Linguistics.
  10. Gautier Izacard and Edouard Grave. 2021b. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online. Association for Computational Linguistics.
  11. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299.
  12. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547.
  13. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
  14. Nearest neighbor machine translation. In International Conference on Learning Representations.
  15. Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations (ICLR).
  16. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR (Poster).
  17. A survey on retrieval-augmented text generation. arXiv preprint 2202.01110.
  18. Pyserini: A python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, page 2356–2362, New York, NY, USA. Association for Computing Machinery.
  19. Apache Lucene. Lucene analyzer.
  20. On the generalization ability of retrieval-enhanced transformers. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1485–1493, Dubrovnik, Croatia. Association for Computational Linguistics.
  21. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
  22. Scaling language models: Methods, analysis & insights from training gopher. CoRR, abs/2112.11446.
  23. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  24. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083.
  25. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  26. Okapi at TREC-3. In Overview of the Third Text REtrieval Conference (TREC-3), pages 109–126. Gaithersburg, MD: NIST.
  27. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  28. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  29. Memorizing transformers. In International Conference on Learning Representations.
  30. Why do nearest neighbor language models work? arXiv preprint arXiv:2301.02828.
  31. Defending against neural fake news. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ehsan Doostmohammadi (11 papers)
  2. Tobias Norlund (6 papers)
  3. Marco Kuhlmann (13 papers)
  4. Richard Johansson (18 papers)
Citations (7)