Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs (2312.05934v3)

Published 10 Dec 2023 in cs.AI, cs.CL, and cs.LG
Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs

Abstract: LLMs encapsulate a vast amount of factual information within their pre-trained weights, as evidenced by their ability to answer diverse questions across different domains. However, this knowledge is inherently limited, relying heavily on the characteristics of the training data. Consequently, using external datasets to incorporate new information or refine the capabilities of LLMs on previously seen information poses a significant challenge. In this study, we compare two common approaches: unsupervised fine-tuning and retrieval-augmented generation (RAG). We evaluate both approaches on a variety of knowledge-intensive tasks across different topics. Our findings reveal that while unsupervised fine-tuning offers some improvement, RAG consistently outperforms it, both for existing knowledge encountered during training and entirely new knowledge. Moreover, we find that LLMs struggle to learn new factual information through unsupervised fine-tuning, and that exposing them to numerous variations of the same fact during training could alleviate this problem.

This paper investigates two prevalent methods for knowledge injection into LLMs: unsupervised fine-tuning and retrieval-augmented generation (RAG). Given the static and non-specific nature of an LLM's knowledge base derived from pre-training, the paper examines how these methods enhance domain-specific knowledge and update the factual information base of the models.

Purpose of Knowledge Injection

Knowledge injection is critical to improving the domain-specific expertise and factual accuracy of LLMs. The authors underscore the importance of distinguishing between previously encountered knowledge and entirely new facts. The former refers to facts within the model's training data, while the latter involves new information that the model has not been exposed to during training.

Methodologies Compared

  1. Unsupervised Fine-Tuning:
    • This method adapts a pre-trained model using additional task-specific data without the use of labeled data. While it improves the model’s performance, this technique shows limited efficacy in learning new information.
    • One problem identified is that LLMs have difficulty assimilating new factual knowledge through unsupervised fine-tuning unless they encounter numerous variations of the same fact during training.
  2. Retrieval-Augmented Generation (RAG):
    • RAG integrates retrieval mechanisms that allow an LLM to access external knowledge bases dynamically, thereby enhancing the model’s ability to incorporate new information that wasn't present in the original training data.
    • The RAG method consistently outperforms fine-tuning, particularly in injecting new facts into the model. It avoids issues like catastrophic forgetting, where the model loses previously learned information due to the adaptation process.

Key Findings

  • Performance: RAG outstrips fine-tuning in terms of improving LLM’s knowledge, regardless of whether the information was previously encountered during training or entirely new.
  • Reliability: RAG is more robust in updating LLMs’ knowledge bases without degrading other model capabilities.
  • Challenges in Fine-Tuning: Unsupervised fine-tuning showed limited improvements and was less reliable for incorporating new factual information.

Future Directions

The paper highlights potential areas for further investigation:

  • Optimization of RAG: The variability in the number of documents to retrieve in RAG suggests a need for more efficient strategies in selecting relevant information.
  • Combined Techniques: The exploration of hybrid knowledge injection techniques, including supervised and reinforcement learning, could provide more comprehensive solutions.
  • Knowledge Representation: Further studies are needed to understand how LLMs internally represent knowledge, which could advance future improvements in knowledge injection methods.

By providing these insights, the paper makes significant contributions to understanding how to better inject knowledge into LLMs, thereby enhancing their functionality and adaptability for various domain-specific applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Attardi, G. Wikiextractor. https://github.com/attardi/wikiextractor, 2015.
  2. The reversal curse: Llms trained on” a is b” fail to learn” b is a”. arXiv preprint arXiv:2309.12288, 2023.
  3. Recall and learn: Fine-tuning deep pretrained language models with less forgetting. arXiv preprint arXiv:2004.12651, 2020.
  4. Knowprompt: Knowledge-aware prompt-tuning with synergistic optimization for relation extraction. In Proceedings of the ACM Web conference 2022, pp.  2778–2788, 2022.
  5. Meta-learning via language model in-context tuning. arXiv preprint arXiv:2110.07814, 2021.
  6. Instructeval: Towards holistic evaluation of instruction-tuned large language models. arXiv preprint arXiv:2306.04757, 2023.
  7. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  8. Crawling the internal knowledge-base of language models. arXiv preprint arXiv:2301.12810, 2023.
  9. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161, 2019.
  10. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
  11. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.
  12. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  13. A survey of knowledge enhanced pre-trained language models. IEEE Transactions on Knowledge and Data Engineering, 2023.
  14. Lawyer llama technical report. arXiv preprint arXiv:2305.15062, 2023.
  15. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  16. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  17. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pp.  15696–15707. PMLR, 2023.
  18. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  19. Can language models learn from explanations in context? arXiv preprint arXiv:2204.02329, 2022.
  20. Common sense or world knowledge? investigating adapter-based knowledge injection into pretrained transformers. arXiv preprint arXiv:2005.11787, 2020.
  21. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  22. K-bert: Enabling language representation with knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  2901–2908, 2020.
  23. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747, 2023.
  24. Metaicl: Learning to learn in context. arXiv preprint arXiv:2110.15943, 2021.
  25. Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773, 2021.
  26. Orca 2: Teaching small language models how to reason. arXiv preprint arXiv:2311.11045, 2023.
  27. Text and code embeddings by contrastive pre-training. ArXiv, abs/2201.10005, 2022. URL https://api.semanticscholar.org/CorpusID:246275593.
  28. Nguyen, H.-T. A brief report on lawgpt 1.0: A virtual legal assistant based on gpt-3. arXiv preprint arXiv:2302.05729, 2023.
  29. Capabilities of gpt-4 on medical challenge problems. ArXiv, abs/2303.13375, 2023. URL https://api.semanticscholar.org/CorpusID:257687695.
  30. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
  31. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  32. Language models as knowledge bases? arXiv preprint arXiv:1909.01066, 2019.
  33. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  34. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  35. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  36. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  37. Systematic review of effect of data augmentation using paraphrasing on named entity recognition. In NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research, 2022. URL https://openreview.net/forum?id=rc2h1h89aDi.
  38. Text data augmentation for deep learning. Journal of Big Data, 8, 2021. URL https://api.semanticscholar.org/CorpusID:236096559.
  39. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023a.
  40. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617, 2023b.
  41. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  42. Can chatgpt replace traditional kbqa models? an in-depth analysis of the question answering performance of the gpt llm family. In International Semantic Web Conference, pp.  348–367. Springer, 2023.
  43. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
  44. Memorization without overfitting: Analyzing the training dynamics of large language models. ArXiv, abs/2205.10770, 2022. URL https://api.semanticscholar.org/CorpusID:248986465.
  45. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  46. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
  47. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. arXiv preprint arXiv:2310.07521, 2023.
  48. K-adapter: Infusing knowledge into pre-trained models with adapters. arXiv preprint arXiv:2002.01808, 2020.
  49. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705, 2022.
  50. Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454, 2023a.
  51. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023b.
  52. C-pack: Packaged resources to advance general chinese embedding, 2023.
  53. Fingpt: Open-source financial large language models. arXiv preprint arXiv:2306.06031, 2023.
  54. A survey of knowledge-enhanced text generation. ACM Computing Surveys, 54(11s):1–38, 2022.
  55. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Oded Ovadia (8 papers)
  2. Menachem Brief (3 papers)
  3. Moshik Mishaeli (3 papers)
  4. Oren Elisha (8 papers)
Citations (86)
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews