Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Structure-aware Domain Knowledge Injection for Large Language Models (2407.16724v2)

Published 23 Jul 2024 in cs.CL

Abstract: This paper introduces a pioneering methodology, termed StructTuning, to efficiently transform foundation LLMs into domain specialists. It significantly reduces the training corpus requirement to a mere 0.3%, while achieving an impressive 50% of traditional knowledge injection performance. Our method is inspired by the educational processes of human students, particularly how structured domain knowledge from textbooks is assimilated and subsequently applied to tackle real-world challenges through specific exercises. Based on this, we propose a novel two-stage strategy for knowledge injection and alignment: Structure-aware Continual Pre-Training (SCPT) and Structure-aware Supervised Fine-Tuning (SSFT). In the SCPT phase, we automatically extract the domain knowledge taxonomy and reorganize the training corpora, enabling LLMs to effectively link textual segments to targeted knowledge points within the taxonomy. In the SSFT phase, we explicitly prompt models to elucidate the underlying knowledge structure in their outputs, leveraging the structured domain insight to address practical problems. Our ultimate method has undergone extensive evaluations across model architectures and scales, using closed-book question-answering tasks on LongBench and MMedBench datasets. Remarkably, our method demonstrates the potential of comparable improvement against the state-of-the-art MMedLM2 on MMedBench, while significantly reducing the training costs to 5%. This breakthrough paves the way for scaling up our StructTuning for stronger domain-specific LLMs with comprehensive data utilization. Code is available at https://github.com/alibaba/struxgpt.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Meta AI. Introducing meta llama 3: The most capable openly available llm to date. 2024.
  3. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
  4. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023b.
  5. A neural probabilistic language model. Advances in neural information processing systems, 13, 2000.
  6. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Understanding retrieval augmentation for long-form question answering. arXiv preprint arXiv:2310.12150, 2023.
  9. Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092, 2023.
  10. A dataset of information-seeking questions and answers anchored in research papers. arXiv preprint arXiv:2105.03011, 2021.
  11. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021.
  12. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  13. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
  14. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964, 2020.
  15. Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247, 2023.
  16. Dureader: a chinese machine reading comprehension dataset from real-world applications. arXiv preprint arXiv:1711.05073, 2017.
  17. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. arXiv preprint arXiv:2011.01060, 2020.
  18. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  19. Simple and scalable strategies to continually pre-train large language models. arXiv preprint arXiv:2403.08763, 2024.
  20. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  21. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. arXiv preprint arXiv:2009.13081, 2020.
  22. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019.
  23. David R Krathwohl. A revision of bloom’s taxonomy: An overview. Theory into practice, 41(4):212–218, 2002.
  24. Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning. arXiv preprint arXiv:2304.05613, 2023.
  25. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  26. Ecomgpt: Instruction-tuning large language models with chain-of-task tasks for e-commerce. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18582–18590, 2024.
  27. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  28. Best practices and lessons learned on synthetic data for language models. arXiv preprint arXiv:2404.07503, 2024a.
  29. Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353, 2024b.
  30. Injecting new knowledge into large language models via supervised fine-tuning. arXiv preprint arXiv:2404.00213, 2024.
  31. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023.
  32. Ha-Thanh Nguyen. A brief report on lawgpt 1.0: A virtual legal assistant based on gpt-3. arXiv preprint arXiv:2302.05729, 2023.
  33. Fine-tuning or retrieval? comparing knowledge injection in llms. arXiv preprint arXiv:2312.05934, 2023.
  34. Multilingual large language model: A survey of resources, taxonomy and frontiers. arXiv preprint arXiv:2404.04925, 2024.
  35. Towards building multilingual language model for medicine. arXiv preprint arXiv:2402.13963, 2024.
  36. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  37. Ernie 2.0: A continual pre-training framework for language understanding. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 8968–8975, 2020.
  38. Mathscale: Scaling instruction tuning for mathematical reasoning. arXiv preprint arXiv:2403.02884, 2024.
  39. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  40. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  41. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022.
  42. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  43. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. arXiv preprint arXiv:2310.07521, 2023a.
  44. Huatuo: Tuning llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975, 2023b.
  45. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. arXiv preprint arXiv:2310.00746, 2023c.
  46. Mindmap: Knowledge graph prompting sparks graph of thoughts in large language models. arXiv preprint arXiv:2308.09729, 2023.
  47. Pmc-llama: toward building open-source language models for medicine. Journal of the American Medical Informatics Association, page ocae045, 2024.
  48. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
  49. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023a.
  50. Kilm: Knowledge injection into encoder-decoder language models. arXiv preprint arXiv:2302.09170, 2023b.
  51. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018.
  52. Kola: Carefully benchmarking world knowledge of large language models. arXiv preprint arXiv:2306.09296, 2023.
  53. Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070, 2023.
  54. Bertscore: Evaluating text generation with bert, 2020.
  55. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023.
  56. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  57. Internlm2 technical report, 2024.
  58. Physics of language models: Part 3.1, knowledge storage and extraction. arXiv preprint arXiv:2309.14316, 2023a.
  59. Physics of language models: Part 3.2, knowledge manipulation. arXiv preprint arXiv:2309.14402, 2023b.

Summary

We haven't generated a summary for this paper yet.