Adapting Large Language Models to Domains via Reading Comprehension (2309.09530v4)

Published 18 Sep 2023 in cs.CL

Abstract: We explore how continued pre-training on domain-specific corpora influences LLMs, revealing that training on the raw corpora endows the model with domain knowledge, but drastically hurts its prompting ability for question answering. Taken inspiration from human learning via reading comprehension--practice after reading improves the ability to answer questions based on the learned knowledge--we propose a simple method for transforming raw corpora into reading comprehension texts. Each raw text is enriched with a series of tasks related to its content. Our method, highly scalable and applicable to any pre-training corpora, consistently enhances performance across various tasks in three different domains: biomedicine, finance, and law. Notably, our 7B LLM achieves competitive performance with domain-specific models of much larger scales, such as BloombergGPT-50B. Furthermore, we demonstrate that domain-specific reading comprehension texts can improve the model's performance even on general benchmarks, showing the potential to develop a general model across even more domains. Our model, code, and data are available at https://github.com/microsoft/LMOps.

PDF Abstract

Adapting LLMs via Reading Comprehension

The paper presents a novel approach to improving domain-specific capabilities of LLMs by transforming raw corpora into reading comprehension texts. By aligning the learning process of models more closely with human learning strategies, this method significantly enhances both domain knowledge acquisition and prompting ability, suggesting promising implications for developing more generalized and competent LLMs.

Key Contributions and Methodology

Problem Identification: The authors identify a critical trade-off in continued pre-training on domain-specific corpora: while it enhances domain knowledge, it simultaneously degrades the model's ability to perform prompting for question answering tasks. This is attributed to the limited diversity in domain-specific corpora compared to more general data sources.
Reading Comprehension Approach: Inspired by human learning, the authors introduce a method where raw corpora are converted into reading comprehension texts, which are then used for further training. This involves appending tasks, such as summarization, word-to-text generation, and natural language inference, to the original texts—effectively simulating a comprehension practice for models.
Integration with General Instructions: The reading comprehension tasks are augmented with general instructions, helping to cover a diverse range of input-output patterns and improve the prompting ability of the models.
Experimental Validation: The authors conduct experiments across three domains: biomedicine, finance, and law. Notably, a 7-billion parameter model using this method achieves competitive performance with much larger, domain-specific models like BloombergGPT-50B.

Experimental Results

The proposed method consistently outperformed baseline models in domain-specific tasks across biomedicine, finance, and law.
The introduction of reading comprehension tasks mitigated the drop in prompting ability seen in models trained solely on raw domain-specific data.
Significant improvements were noted in both fine-tuning and zero-shot prompting evaluations, suggesting effective domain knowledge acquisition.

Implications and Future Directions

The paper's approach underscores the importance of task diversity in leveraging large-scale corpora for LLM adaptation. By infusing domain-specific training with general task-generation strategies, the method achieves a balanced enhancement of domain knowledge and model robustness. This opens avenues for deploying LLMs across multiple highly specialized fields without the prohibitive computational costs of developing models from scratch.

Future developments could explore:

Scalability to other domains and various model sizes.
Integration with reinforcement learning from human feedback (RLHF) for further alignment with human evaluative standards.
Application in real-world environments where generalization capability across diverse, unseen scenarios is critical.

Conclusion

Overall, this paper proposes a feasible and scalable method for domain adaptation of LLMs. The integration of reading comprehension tasks presents a promising pathway towards developing more versatile AI systems capable of excelling across a wide spectrum of specialized tasks. This work contributes a significant methodological innovation to the field of natural language processing, enhancing both theoretical insights and practical applications.