Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Instruction-tuned Language Models are Better Knowledge Learners (2402.12847v2)

Published 20 Feb 2024 in cs.CL, cs.AI, and cs.LG
Instruction-tuned Language Models are Better Knowledge Learners

Abstract: In order for LLM-based assistants to effectively adapt to evolving information needs, it must be possible to update their factual knowledge through continued training on new data. The standard recipe for doing so involves continued pre-training on new documents followed by instruction-tuning on question-answer (QA) pairs. However, we find that LLMs trained with this recipe struggle to answer questions, even though the perplexity of documents is minimized. We found that QA pairs are generally straightforward, while documents are more complex, weaving many factual statements together in an intricate manner. Therefore, we hypothesize that it is beneficial to expose LLMs to QA pairs before continued pre-training on documents so that the process of encoding knowledge from complex documents takes into account how this knowledge is accessed through questions. Based on this, we propose pre-instruction-tuning (PIT), a method that instruction-tunes on questions prior to training on documents. This contrasts with standard instruction-tuning, which learns how to extract knowledge after training on documents. Extensive experiments and ablation studies demonstrate that pre-instruction-tuning significantly enhances the ability of LLMs to absorb knowledge from new documents, outperforming standard instruction-tuning by 17.8%.

Enhancing Knowledge Absorption in LLMs with Pre-Instruction-Tuning

Introduction to Pre-Instruction-Tuning

Recent advancements in the field of LLMs have demonstrated their potential in storing vast amounts of factual knowledge in their parameters. However, the static nature of this knowledge storage means it can quickly become outdated or insufficient for specialized demands. A conventional strategy to update LLMs involves continued pre-training on new documents, followed by instruction-tuning on question-answer pairs. Despite this approach's popularity, our investigations reveal its limitations in effectively updating LLMs' knowledge bases. This paper introduces Pre-Instruction-Tuning (PIT), a novel strategy that reverses the conventional sequence by instruction-tuning LLMs on question-answer pairs prior to document pre-training. Our experiments, conducted using the Llama-2 models, showcase PIT's superiority in enhancing LLMs' knowledge absorption capabilities, with significant improvements over the standard instruction-tuning process.

Methodology and Experiments

The initial phase of our research involved evaluating the extent to which LLMs could enhance their knowledge base through the standard practice of document pre-training followed by instruction-tuning. Utilizing the Llama-2 models for extensive experimentation on the specially curated Wiki2023 dataset, which comprises documents and associated question-answer pairs from Wikipedia articles categorized under the year 2023, we observed a phenomenon we term the "perplexity curse". This denotes the limited increase in accuracy for answered questions despite minimized document perplexity, highlighting the inefficacy of the standard approach in substantially enhancing LLMs' knowledge absorption capability.

To address these limitations, we proposed PIT, hypothesizing its potential in orienting the LLMs towards a more effective knowledge acquisition pathway by exposing them to the format of accessing knowledge (through questions) before learning to encode new information from documents. The methodology involved experimenting with various training sequences, starting with questions before associated documents and vice versa, to ascertain the optimal learning path. Our findings indicate a clear advantage in starting the training sequence with question-answer pairs, thus solidifying the foundation of the PIT approach.

Results and Implications

Our comprehensive evaluation showcases that PIT significantly surpasses the standard instruction-tuning model in enhancing LLMs' ability to absorb and retrieve knowledge from new documents. Specifically, models trained with PIT demonstrated a 17.8% improvement in QA accuracies over their counterparts trained with standard instruction-tuning processes. Furthermore, the PIT approach displayed promising generalization capabilities across different document domains, indicating its potential applicability in a wide range of knowledge absorption and retrieval tasks.

Future Prospects and Limitations

The encouraging outcomes from applying PIT highlight its potential as a pivotal methodology in the advancement of continual learning and knowledge updating in LLMs. Future explorations could extend beyond Wikipedia-based datasets to encompass varied data sources, thus broadening the effectiveness and applicability of the PIT approach in dynamically updating LLMs across diverse information domains. However, it's important to acknowledge the current limitations, including the focus on Wikipedia articles for dataset creation and the specific aim of enhancing factual knowledge retrieval, which may not directly translate to improvements in skills such as reasoning or comprehension.

Acknowledgements and Concluding Remarks

The contribution of various researchers and the feedback received throughout the investigation have been invaluable in shaping this paper. In conclusion, Pre-Instruction-Tuning emerges as a compelling strategy for enhancing the knowledge learning capabilities of LLMs, presenting a significant step forward in the field of generative AI and model training methodologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Neuro-symbolic language modeling with automaton-augmented retrieval. In International Conference on Machine Learning.
  2. Self-rag: Learning to retrieve, generate, and critique through self-reflection. CoRR, abs/2310.11511.
  3. The reversal curse: Llms trained on "a is b" fail to learn "b is a". CoRR, abs/2309.12288.
  4. Improving language models by retrieving from trillions of tokens. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 2206–2240. PMLR.
  5. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  6. Quantifying memorization across neural language models. CoRR, abs/2202.07646.
  7. Reading wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1870–1879. Association for Computational Linguistics.
  8. Adapting large language models via reading comprehension. CoRR, abs/2309.09530.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  10. Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311.
  11. Should we be pre-training? an argument for end-task aware training as an alternative. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  12. Gemini Team. 2023. Gemini: A family of highly capable multimodal models.
  13. REALM: retrieval-augmented language model pre-training. CoRR, abs/2002.08909.
  14. Medalpaca - an open-source collection of medical conversational AI models and training data. CoRR, abs/2304.08247.
  15. Efficient nearest neighbor language models. In Conference on Empirical Methods in Natural Language Processing.
  16. Meta-learning online adaptation of language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 4418–4432. Association for Computational Linguistics.
  17. Camels in a changing climate: Enhancing LM adaptation with tulu 2. CoRR, abs/2311.10702.
  18. OPT-IML: scaling language model instruction meta learning through the lens of generalization. CoRR, abs/2212.12017.
  19. Atlas: Few-shot learning with retrieval augmented language models. J. Mach. Learn. Res., 24:251:1–251:43.
  20. Towards continual knowledge learning of language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  21. Retrieval as attention: End-to-end learning of retrieval and reading within a single transformer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 2336–2349. Association for Computational Linguistics.
  22. How can we know what language models know. Trans. Assoc. Comput. Linguistics, 8:423–438.
  23. Active retrieval augmented generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 7969–7992. Association for Computational Linguistics.
  24. Openassistant conversations - democratizing large language model alignment. ArXiv, abs/2304.07327.
  25. Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics, 7:452–466.
  26. You only need one model for open-domain question answering. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 3047–3060. Association for Computational Linguistics.
  27. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 7871–7880. Association for Computational Linguistics.
  28. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  29. RA-DIT: retrieval-augmented dual instruction tuning. CoRR, abs/2310.01352.
  30. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  31. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 3470–3487. Association for Computational Linguistics.
  32. Webgpt: Browser-assisted question-answering with human feedback. CoRR, abs/2112.09332.
  33. Astrollama: Towards specialized foundation models in astronomy. CoRR, abs/2309.06126.
  34. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  35. Training language models to follow instructions with human feedback. CoRR, abs/2203.02155.
  36. Fine-tuning or retrieval? comparing knowledge injection in llms. CoRR, abs/2312.05934.
  37. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 2463–2473. Association for Computational Linguistics.
  38. Webcpm: Interactive web search for chinese long-form question answering. CoRR, abs/2305.06849.
  39. Language models are unsupervised multitask learners. OpenAI Blog, 1(8).
  40. Direct preference optimization: Your language model is secretly a reward model. CoRR, abs/2305.18290.
  41. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  42. How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 5418–5426. Association for Computational Linguistics.
  43. End-to-end training of multi-document reader and retriever for open-domain question answering. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 25968–25981.
  44. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  45. REPLUG: retrieval-augmented black-box language models. CoRR, abs/2301.12652.
  46. SALMON: self-alignment with principle-following reward models. CoRR, abs/2310.05910.
  47. Principle-driven self-alignment of language models from scratch with minimal human supervision. CoRR, abs/2305.03047.
  48. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  49. Fine-tuning language models for factuality. CoRR, abs/2311.08401.
  50. Memorization without overfitting: Analyzing the training dynamics of large language models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  51. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  52. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  53. Shall we pretrain autoregressive language models with retrieval? A comprehensive study. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 7763–7786. Association for Computational Linguistics.
  54. Can generative pre-trained language models serve as knowledge bases for closed-book qa? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 3241–3251. Association for Computational Linguistics.
  55. How far can camels go? exploring the state of instruction tuning on open resources. CoRR, abs/2306.04751.
  56. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  57. Pmc-llama: Towards building open-source language models for medicine.
  58. Training trajectories of language models across scales. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13711–13738. Association for Computational Linguistics.
  59. A self-enhancement approach for domain-specific chatbot training via knowledge mining and digest. CoRR, abs/2311.10614.
  60. Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068.
  61. A survey of large language models. CoRR, abs/2303.18223.
  62. LIMA: less is more for alignment. CoRR, abs/2305.11206.
  63. Zeyuan Allen Zhu and Yuanzhi Li. 2023a. Physics of language models: Part 3.1, knowledge storage and extraction. CoRR, abs/2309.14316.
  64. Zeyuan Allen Zhu and Yuanzhi Li. 2023b. Physics of language models: Part 3.2, knowledge manipulation. CoRR, abs/2309.14402.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Zhengbao Jiang (25 papers)
  2. Zhiqing Sun (35 papers)
  3. Weijia Shi (55 papers)
  4. Pedro Rodriguez (24 papers)
  5. Chunting Zhou (36 papers)
  6. Graham Neubig (342 papers)
  7. Xi Victoria Lin (39 papers)
  8. Wen-tau Yih (84 papers)
  9. Srinivasan Iyer (20 papers)
Citations (18)