Language Model Adaptation to Specialized Domains through Selective Masking based on Genre and Topical Characteristics (2402.12036v2)
Abstract: Recent advances in pre-trained LLMing have facilitated significant progress across various NLP tasks. Word masking during model training constitutes a pivotal component of LLMing in architectures like BERT. However, the prevalent method of word masking relies on random selection, potentially disregarding domain-specific linguistic attributes. In this article, we introduce an innovative masking approach leveraging genre and topicality information to tailor LLMs to specialized domains. Our method incorporates a ranking process that prioritizes words based on their significance, subsequently guiding the masking procedure. Experiments conducted using continual pre-training within the legal domain have underscored the efficacy of our approach on the LegalGLUE benchmark in the English language. Pre-trained LLMs and code are freely available for use.
- Linguistically informed masking for representation learning in the patent domain. In 2nd Workshop on Patent Text Mining and Semantic Technologies (PatentSemTech2021), New York, NY, USA. Association for Computing Machinery.
- Unilmv2: Pseudo-masked language models for unified language model pre-training.
- Douglas Biber and Susan Conrad. 2019. Register, Genre, and Style. Cambridge University Press. Google-Books-ID: x7OQDwAAQBAJ.
- LEGAL-BERT: The muppets straight out of law school. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2898–2904, Online. Association for Computational Linguistics.
- LEGAL-BERT: The Muppets straight out of Law School. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2898–2904, Online. Association for Computational Linguistics.
- LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15513–15535, Toronto, Canada. Association for Computational Linguistics.
- LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4310–4330, Dublin, Ireland. Association for Computational Linguistics.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Realm: Retrieval-augmented language model pre-training.
- Nicolas Hernandez and Brigitte Grau. 2003. Automatic extraction of meta-descriptors for text description. In International Conference on Recent Advances In Natural Language Processing (RANLP), Borovets, Bulgaria.
- Ken Hyland. 1998. Persuasion and context: The pragmatics of academic metadiscourse. Journal of pragmatics, 30(4):437–455.
- Karen Spärck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11–21.
- SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77.
- Continual pre-training of language models. In The Eleventh International Conference on Learning Representations.
- DrBERT: A robust pre-trained model in French for biomedical and clinical domains. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16207–16221, Toronto, Canada. Association for Computational Linguistics.
- Pmi-masking: Principled masking of correlated spans.
- Yian Li and Hai Zhao. 2021. Pre-training universal language representation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5122–5133, Online. Association for Computational Linguistics.
- Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
- ERNIE: Enhanced Representation through Knowledge Integration. ArXiv:1904.09223 [cs].
- Ernie: Enhanced representation through knowledge integration.
- Pretrained language model in continual learning: A comparative study. In International Conference on Learning Representations.
- Learning Better Masking for Better Language Model Pre-training. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7255–7267, Toronto, Canada. Association for Computational Linguistics.
- Opt: Open pre-trained transformer language models.