Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Domain Adaptation through Extended-Text Reading Comprehension (2401.07284v2)

Published 14 Jan 2024 in cs.CL

Abstract: To enhance the domain-specific capabilities of LLMs, continued pre-training on a domain-specific corpus is a prevalent method. Recent work demonstrates that adapting models using reading comprehension data formatted by regex-based patterns can significantly improve performance on domain-specific tasks. However, regex-based patterns are incapable of parsing raw corpora using domain-specific knowledge. Furthermore, the question and answer pairs are extracted directly from the corpus in predefined formats offers limited context. To address this limitation, we improve reading comprehension via LLM and clustering. LLM focuses on leveraging domain knowledge within the corpus to refine comprehension stage, while clustering supplies relevant knowledge by extending the context to enrich reading stage. Additionally, our method incorporates parameter-efficient fine-tuning to improve the efficiency of domain adaptation. In comparison to AdaptLLM, our method achieves an improvement exceeding 5% in domain-specific tasks. Our code will available at https://github.com/microsoft/LMOps.

Improving Domain Adaptation through Extended-Text Reading Comprehension

This paper contributes an advancement in domain adaptation techniques for LLMs by utilizing extended-text reading comprehension methodologies. The researchers propose a novel approach leveraging LLMs and clustering techniques to address specific limitations identified in regex-based patterns used by existing models like AdaptLLM.

Overview and Methodology

The paper aims to enhance the domain-specific capabilities of LLMs by refining the reading comprehension paradigm. Traditional approaches, such as AdaptLLM, rely heavily on regex-based patterns to transform corpora into structured question-answer formats. However, these patterns struggle with complex domain-specific knowledge extraction and provide limited context.

To overcome these challenges, the researchers developed a multifaceted approach:

  1. LLM-based Data Preprocessing: By employing models like ChatGPT, the approach generates high-quality question-answer pairs from the domain corpus. This addresses the inadequacies of regex patterns by better capturing domain nuances. Additionally, they fine-tune a smaller LLM to efficiently preprocess extensive datasets, mitigating the cost implications of using API-based LLMs.
  2. Length-based Clustering: The method enhances context comprehension by clustering similar documents, thus extending the input context. This is particularly beneficial in domains like biomedicine, where documents such as abstracts tend to be concise but require deeper contextual understanding.
  3. Parameter Efficient Fine-Tuning: The paper explores the use of LoRA for parameter-efficient fine-tuning, showing that with appropriate settings, it can outperform traditional full fine-tuning methods, especially for domain-specific knowledge embedding.

Experimental Results

Empirical evaluations demonstrate substantial performance improvements over existing models. The method achieves over 5% enhancement in domain-specific tasks compared to AdaptLLM. Noteworthy results were obtained for biomedicine and finance domains:

  • Biomedicine Domain: The approach yielded an average improvement in performance metrics across several datasets, including PubMedQA and BioMMLU.
  • Finance Domain: Similar performance gains were observed, with significant improvements on datasets like ConvFinQA and FPB.

These results emphasize the efficacy of integrating extended-text context through clustering and the strategic use of LLM-based preprocessing.

Implications and Future Directions

The methodological advancements presented in this paper have significant theoretical and practical implications:

  • Improved Domain Adaptation: The integration of LLMs for data preprocessing and extended context through clustering could redefine domain adaptation strategies, making them more robust and contextually aware.
  • Efficiency in Model Adaptation: By demonstrating the effectiveness of parameter-efficient tuning, the research provides a road map for more resource-effective adaptation processes, potentially broadening the accessibility of advanced AI models to various applications.

Future research may explore further the scalability of these techniques across other domains, such as legal or industrial applications. Additionally, refining clustering algorithms to dynamically adjust context lengths or integrating more sophisticated LLMs could offer even greater performance enhancements.

In conclusion, this paper makes a significant contribution by proposing an innovative approach to domain adaptation, leveraging enhanced reading comprehension techniques to improve model performance in domain-specific tasks. The findings open new avenues for efficient model training and adaptation in complex domains, offering both practical benefits and theoretical insights.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning. arXiv preprint arXiv:2310.15205.
  2. Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering. arXiv preprint arXiv:2210.03849.
  3. Adapting large language models via reading comprehension. arXiv preprint arXiv:2309.09530.
  4. Knowledge neurons in pretrained transformers. arXiv preprint arXiv:2104.08696.
  5. Franck Dernoncourt and Ji Young Lee. 2017. Pubmed 200k rct: a dataset for sequential sentence classification in medical abstracts. arXiv preprint arXiv:1710.06071.
  6. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  7. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964.
  8. Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247.
  9. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  10. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  11. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
  12. Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577.
  13. The inductive bias of in-context learning: Rethinking pretraining example design. arXiv preprint arXiv:2110.04541.
  14. Chipnemo: Domain-adapted llms for chip design. arXiv preprint arXiv:2311.00176.
  15. Fingpt: Democratizing internet-scale data for financial large language models. arXiv preprint arXiv:2307.10485.
  16. Kai Lu. 2023. Can chatgpt help college instructors generate high-quality quiz questions? Human Interaction and Emerging Technologies (IHIET-AI 2023): Artificial Intelligence and Future Applications, 70(70).
  17. Www’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018, pages 1941–1942.
  18. Good debt or bad debt: Detecting semantic orientations in economic texts. Journal of the Association for Information Science and Technology, 65(4):782–796.
  19. Effective transfer learning for identifying similar questions: matching user questions to covid-19 faqs. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3458–3465.
  20. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
  21. Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660.
  22. Andrew M Olney. 2023. Generating multiple choice questions from a textbook: Llms match human performance on most metrics. In AIED Workshops.
  23. Domain adaption of named entity recognition to support credit risk assessment. In Proceedings of the Australasian Language Technology Association Workshop 2015, pages 84–90, Parramatta, Australia.
  24. In-context pretraining: Language modeling beyond document boundaries. arXiv preprint arXiv:2310.10638.
  25. Ankur Sinha and Tanmay Khandait. 2021. Impact of news on the commodity market: Dataset and results. In Advances in Information and Communication: Proceedings of the 2021 Future of Information and Communication Conference (FICC), Volume 2, pages 589–601. Springer.
  26. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  27. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564.
  28. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  29. Retrieve anything to augment large language models. arXiv preprint arXiv:2310.07554.
  30. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Ting Jiang (28 papers)
  2. Shaohan Huang (79 papers)
  3. Shengyue Luo (2 papers)
  4. Zihan Zhang (120 papers)
  5. Haizhen Huang (18 papers)
  6. Furu Wei (291 papers)
  7. Weiwei Deng (29 papers)
  8. Feng Sun (34 papers)
  9. Qi Zhang (784 papers)
  10. Deqing Wang (36 papers)
  11. Fuzhen Zhuang (97 papers)
Citations (6)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com