Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study (2304.06762v3)

Published 13 Apr 2023 in cs.CL, cs.AI, cs.IR, and cs.LG
Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study

Abstract: Large decoder-only LLMs (LMs) can be largely improved in terms of perplexity by retrieval (e.g., RETRO), but its impact on text generation quality and downstream task accuracy is unclear. Thus, it is still an open question: shall we pretrain large autoregressive LMs with retrieval? To answer it, we perform a comprehensive study on a scalable pre-trained retrieval-augmented LM (i.e., RETRO) compared with standard GPT and retrieval-augmented GPT incorporated at fine-tuning or inference stages. We first provide the recipe to reproduce RETRO up to 9.5B parameters while retrieving a text corpus with 330B tokens. Based on that, we have the following novel findings: i) RETRO outperforms GPT on text generation with much less degeneration (i.e., repetition), moderately higher factual accuracy, and slightly lower toxicity with a nontoxic retrieval database. ii) On the LM Evaluation Harness benchmark, RETRO largely outperforms GPT on knowledge-intensive tasks, but is on par with GPT on other tasks. Furthermore, we introduce a simple variant of the model, RETRO++, which largely improves open-domain QA results of original RETRO (e.g., EM score +8.6 on Natural Question) and significantly outperforms retrieval-augmented GPT in both fine-tuning and zero-shot evaluation settings. Our findings highlight the promising direction of pretraining autoregressive LMs with retrieval as future foundation models. We release our code and model at: https://github.com/NVIDIA/Megatron-LM/blob/main/tools/retro/README.md

Pretraining Autoregressive LLMs with Retrieval: A Comprehensive Study

The paper "Shall We Pretrain Autoregressive LLMs with Retrieval? A Comprehensive Study" explores the impact and implications of integrating retrieval mechanisms into the pretraining of large autoregressive LLMs. The paper is presented in the context of current advancements in LLMs such as GPT and Retro, with a focus on evaluating the benefits of combining retrieval capabilities at different stages of LLM development.

Key Contributions

  1. Reproduction of Retro Models: The research successfully reproduces and scales the Retro model up to 9.5 billion parameters. This encompasses retrieving text from a vast corpus of 330 billion tokens, aligning with a comprehensive evaluation against traditional GPT models.
  2. Evaluation of Text Generation: Retro is shown to outperform GPT on open-ended text generation tasks, particularly by reducing text degeneration, enhancing factual accuracy, and slightly decreasing toxicity. This suggests that the retrieval component effectively supplements the model's knowledge base.
  3. Performance on Downstream Tasks: The paper conducts evaluations on various benchmarks, notably demonstrating that Retro excels in knowledge-intensive tasks. For example, significant improvements were observed in tasks like open-domain QA.
  4. Retro++ Innovation: A variant named Retro++ is introduced, which markedly improves performance on open-domain question answering benchmarks. This model modification exploits the most relevant retrieved evidence, thereby enhancing accuracy and generation quality.

Numerical Results and Implications

  • Perplexity Reduction: Retro achieves lower perplexity compared to standard GPT across various model sizes, indicating improved LLM efficiency through retrieval augmentation.
  • Downstream Task Performance: A marked improvement in accuracy on knowledge-intensive benchmarks was observed, highlighting the utility of retrieval-enhanced LMs in tasks that demand access to vast, explicit knowledge.

Theoretical and Practical Implications

The findings propose that pretraining autoregressive LLMs with retrieval capabilities could set a new standard for future foundational models. This approach not only reduces the need for larger parameter sizes by offloading some knowledge storage to an external database but also provides a method to update models with fresh information without extensive retraining.

  • Theoretical Impact: The integration of retrieval mechanisms into LLMs suggests a shift in how knowledge is managed within LMs, balancing between internalized knowledge and external retrieval.
  • Practical Impact: This method has potential applications in real-world scenarios where factual accuracy and information update frequency are crucial, such as in legal, medical, and educational domains.

Speculation on Future Developments

Looking forward, this research opens avenues to test even larger-scale retrieval-augmented models, exploring how dynamic retrieval updates during generation can further enhance model performance. It suggests the possibility of real-time retrieval augmentation, which could make LLMs more adaptive and contextually aware.

Overall, this paper provides a compelling case for the incorporation of retrieval mechanisms in the pretraining of autoregressive LLMs, illustrating both their current utility and potential for future advancements. The adaptability and resource efficiency offered by such models make them attractive candidates for a wide range of applications, signaling a promising direction for ongoing research in the field of AI and NLP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Structured retrieval for question answering. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval.
  2. Piqa: Reasoning about physical commonsense in natural language. In AAAI.
  3. Improving language models by retrieving from trillions of tokens. In ICML.
  4. Language models are few-shot learners. NeurIPS.
  5. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  6. Scaling instruction-finetuned language models. arXiv preprint arXiv: 2210.11416.
  7. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL.
  8. Free dolly: Introducing the world’s first truly open instruction-tuned llm.
  9. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  10. Eli5: Long form question answering. Annual Meeting of the Association for Computational Linguistics.
  11. A framework for few-shot language model evaluation.
  12. Optimized product quantization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(4):744–755.
  13. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings in EMNLP.
  14. R.M. Gray and D.L. Neuhoff. 1998. Quantization. IEEE Transactions on Information Theory, 44(6):2325–2383.
  15. REALM: Retrieval augmented language model pre-training. In ICML.
  16. The curious case of neural text degeneration. International Conference On Learning Representations.
  17. Unnatural instructions: Tuning language models with (almost) no human labor. Annual Meeting of the Association for Computational Linguistics.
  18. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research.
  19. Gautier Izacard and Édouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880.
  20. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299.
  21. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547.
  22. Dense passage retrieval for open-domain question answering. In EMNLP.
  23. Generalization through memorization: Nearest neighbor language models.
  24. Soda: Million-scale dialogue distillation with social commonsense contextualization. arXiv preprint arXiv: 2212.10465.
  25. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization.
  26. Internet-augmented dialogue generation. arXiv preprint arXiv:2107.07566.
  27. Openassistant conversations - democratizing large language model alignment. arXiv preprint arXiv: 2304.07327.
  28. Race: Large-scale reading comprehension dataset from examinations. In EMNLP.
  29. Factuality enhanced language models for open-ended text generation. NeurIPS.
  30. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL.
  31. Retrieval-augmented generation for knowledge-intensive NLP tasks. In NeurIPS.
  32. TruthfulQA: Measuring how models mimic human falsehoods. ACL.
  33. The flan collection: Designing data and methods for effective instruction tuning. International Conference on Machine Learning.
  34. Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4):824–836.
  35. Locating and editing factual knowledge in GPT. In NeurIPS.
  36. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  37. Adversarial nli: A new benchmark for natural language understanding. In ACL.
  38. OpenAI. 2022. ChatGPT. https://chat.openai.com.
  39. OpenAI. 2023. GPT-4 technical report. arXiv.
  40. The lambada dataset: Word prediction requiring a broad discourse context. In NAACL.
  41. Language models as knowledge bases? In EMNLP.
  42. Steven T. Piantadosi. 2014. Zipf’s word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin & Review, 21:1112–1130.
  43. Mohammad Taher Pilehvar and Jose Camacho-Collados. 2019. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. In NAACL.
  44. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  45. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
  46. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research.
  47. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.
  48. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567.
  49. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv.
  50. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
  51. Attention is all you need. In NIPS.
  52. Instructretro: Instruction tuning post retrieval-augmented pretraining. arXiv preprint arXiv: 2310.07713.
  53. Self-instruct: Aligning language models with self-generated instructions. Annual Meeting of the Association for Computational Linguistics.
  54. Finetuned language models are zero-shot learners. In ICLR.
  55. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  56. Challenges in detoxifying language models. Findings of EMNLP.
  57. Adaptive semiparametric language models. Transactions of the Association for Computational Linguistics.
  58. Hellaswag: Can a machine really finish your sentence? In ACL.
  59. Guiding neural machine translation with retrieved translation pieces. In NAACL.
  60. Yangqiaoyu Zhou and Chenhao Tan. 2021. Investigating the effect of natural language explanations on out-of-distribution generalization in few-shot NLI. In Proceedings of the Second Workshop on Insights from Negative Results in NLP, pages 117–124, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  61. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, page 1097–1100, New York, NY, USA. Association for Computing Machinery.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Boxin Wang (28 papers)
  2. Wei Ping (51 papers)
  3. Peng Xu (357 papers)
  4. Lawrence McAfee (6 papers)
  5. Zihan Liu (102 papers)
  6. Mohammad Shoeybi (60 papers)
  7. Yi Dong (46 papers)
  8. Oleksii Kuchaiev (31 papers)
  9. Bo Li (1107 papers)
  10. Chaowei Xiao (110 papers)
  11. Anima Anandkumar (236 papers)
  12. Bryan Catanzaro (123 papers)
Citations (52)