Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge (2403.01432v5)

Published 3 Mar 2024 in cs.CL
Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge

Abstract: LLMs (LMs) memorize a vast amount of factual knowledge, exhibiting strong performance across diverse tasks and domains. However, it has been observed that the performance diminishes when dealing with less-popular or low-frequency concepts and entities, for example in domain specific applications. The two prominent approaches to enhance the performance of LMs on low-frequent topics are: Retrieval Augmented Generation (RAG) and fine-tuning (FT) over synthetic data. This paper explores and evaluates the impact of RAG and FT on customizing LMs in handling low-frequency entities on question answering tasks. We conduct extensive experiments on twelve LMs of varying size and type and different fine tuning, data augmentation, and retrieval models. Our findings indicate that while FT boosts the performance across entities of varying popularity, RAG surpasses FT by a large margin particularly for least popular factual knowledge. Additionally, the success of both RAG and FT approaches is amplified by improving retrieval and data augmentation techniques. Fine tuning, while beneficial for small LMs, requires extensive resources. To address this issue, we propose the new Stimulus RAG approach that surpasses the effectiveness of fine tuning based approaches, thereby eliminating the need for the costly data augmentation and fine tuning step for enriching LMs with less popular factual knowledge. The code is available at \url{https://github.com/informagi/RAGvsFT}.

Analyzing Fine-Tuning Versus Retrieval Augmented Generation for Handling Low-Frequency Knowledge in LLMs

LLMs have demonstrated notable success across a broad range of tasks due to their capacity to memorize vast quantities of factual information. Nonetheless, their performance can decline when dealing with low-frequency or domain-specific entities. Two key approaches to enhance model performance in these contexts are Retrieval Augmented Generation (RAG) and Fine-Tuning (FT). This paper scrutinizes the impact these methods have on improving LLMs when confronted with low-frequency entities during open-domain question answering tasks.

Summary of Findings

The research presented indicates that fine-tuning significantly enhances performance, especially for entities within the extremes of popularity, although RAG consistently outperforms other methods. The efficacy of both strategies increases with advancements in retrieval and data augmentation techniques. Specifically, the paper's key findings are:

  • Effective Strategies: RAG was shown to consistently outperform FT, particularly when used alongside fine-tuning. This synergy, however, dissipates in larger models due to enhanced internal memory capabilities.
  • Fine-Tuning Variants: PEFT methods like QLoRA offer smaller performance improvements compared to full FT. However, when combined with RAG, PEFT methods prove beneficial, highlighting their ability to maintain inherent LLM reasoning capabilities.
  • Synthetic Data Quality: Rather than the sheer volume, the quality of synthetic data profoundly impacts performance. Prompt-based data generation methods, for example, yielded stronger results compared to the end-to-end generation approach.
  • Model Size and Retrieval Techniques: Larger models, with improved memorization capabilities, show reduced need for FT and RAG strategies for less popular knowledge. Nonetheless, the performance of both RAG and FT is closely tied to the retrieval system’s accuracy.

Practical and Theoretical Implications

Practically, the findings underscore the significance of tailoring approaches based on model size and the specific type of knowledge being dealt with. Industries deploying LLMs in specialized domains may consider adopting hybrid strategies that capitalize on both RAG and FT, especially when working with smaller models.

Theoretically, the paper advances our understanding of how retrieval and fine-tuning intersect to improve model performance with infrequent knowledge. It highlights the importance of the quality of both synthetic data and retrieval, shifting focus from merely expanding data volume. This understanding points towards the potential development of even more specialized tuning techniques or hybrid models that can dynamically adapt based on the type of query or context.

Future Directions

Future research could explore the application of these methodologies to more complex QA tasks, such as multi-hop and conversational QA. Further investigation into the development of advanced QA generation techniques could improve the quality of synthetic data, potentially enabling more cost-effective and efficient fine-tuning.

By offering insights into the nuanced impacts of RAG and FT, this paper contributes to the ongoing dialogue regarding the optimization of LLMs for domain-specific applications, potentially guiding future advances in the field of AI customization techniques.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Synthetic QA corpora generation with roundtrip consistency. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 6168–6173. Association for Computational Linguistics.
  2. Retrieval-based language models and applications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, ACL 2023, Toronto, Canada, July 9-14, 2023, pages 41–46. Association for Computational Linguistics.
  3. Self-rag: Learning to retrieve, generate, and critique through self-reflection. CoRR, abs/2310.11511.
  4. Expand, highlight, generate: Rl-driven document generation for passage reranking. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 10087–10099. Association for Computational Linguistics.
  5. Evaluating entity disambiguation and the role of popularity in retrieval-based NLP. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 4472–4485. Association for Computational Linguistics.
  6. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113.
  7. Conversational question answering on heterogeneous sources. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, pages 144–154. ACM.
  8. Compmix: A benchmark for heterogeneous question answering. CoRR, abs/2306.12235.
  9. Scaling instruction-finetuned language models. CoRR, abs/2210.11416.
  10. RAG vs fine-tuning: Pipelines, tradeoffs, and a case study on agriculture. CoRR, abs/2401.08406.
  11. Qlora: Efficient finetuning of quantized llms. CoRR, abs/2305.14314.
  12. Ameya Godbole and Robin Jia. 2023. Benchmarking long-tail generalization with likelihood splits. In Findings of the Association for Computational Linguistics: EACL 2023, Dubrovnik, Croatia, May 2-6, 2023, pages 933–953. Association for Computational Linguistics.
  13. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  14. Unsupervised dense information retrieval with contrastive learning. Trans. Mach. Learn. Res., 2022.
  15. Challenges and applications of large language models. CoRR, abs/2307.10169.
  16. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, ICML, volume 202 of Proceedings of Machine Learning Research, pages 15696–15707. PMLR.
  17. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 6769–6781. Association for Computational Linguistics.
  18. Realtime QA: what’s the answer right now? CoRR, abs/2207.13332.
  19. Unsupervised question answering by cloze translation. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4896–4910. Association for Computational Linguistics.
  20. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  21. PAQ: 65 million probably-asked questions and what you can do with them. Trans. Assoc. Comput. Linguistics, 9:1098–1115.
  22. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  23. Fine-tuning llama for multi-stage text retrieval. CoRR, abs/2310.08319.
  24. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),, pages 9802–9822. Association for Computational Linguistics.
  25. Nonparametric masked language modeling. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 2097–2118. Association for Computational Linguistics.
  26. Few-shot fine-tuning vs. in-context learning: A fair comparison and evaluation. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 12284–12314. Association for Computational Linguistics.
  27. A comprehensive overview of large language models. CoRR, abs/2307.06435.
  28. Fine-tuning or retrieval? comparing knowledge injection in llms. CoRR, abs/2312.05934.
  29. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  30. Stephen E. Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr., 3(4):333–389.
  31. Simple entity-centric questions challenge dense retrievers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 6138–6148. Association for Computational Linguistics.
  32. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, pages 3784–3803. Association for Computational Linguistics.
  33. Data augmentation for conversational AI. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM 2023, Birmingham, United Kingdom, October 21-25, 2023, pages 5220–5223. ACM.
  34. Head-to-tail: How knowledgeable are large language models (llm)? A.K.A. will llms replace knowledge graphs? CoRR, abs/2308.10168.
  35. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual.
  36. Zephyr: Direct distillation of LM alignment. CoRR, abs/2310.16944.
  37. An empirical comparison of LM-based question and answer generation methods. In Findings of the Association for Computational Linguistics: ACL 2023, pages 14262–14272, Toronto, Canada. Association for Computational Linguistics.
  38. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 1–9. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Heydar Soudani (6 papers)
  2. Evangelos Kanoulas (79 papers)
  3. Faegheh Hasibi (27 papers)
Citations (11)
Youtube Logo Streamline Icon: https://streamlinehq.com