Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Language Model Adaptation to Specialized Domains through Selective Masking based on Genre and Topical Characteristics (2402.12036v2)

Published 19 Feb 2024 in cs.CL

Abstract: Recent advances in pre-trained LLMing have facilitated significant progress across various NLP tasks. Word masking during model training constitutes a pivotal component of LLMing in architectures like BERT. However, the prevalent method of word masking relies on random selection, potentially disregarding domain-specific linguistic attributes. In this article, we introduce an innovative masking approach leveraging genre and topicality information to tailor LLMs to specialized domains. Our method incorporates a ranking process that prioritizes words based on their significance, subsequently guiding the masking procedure. Experiments conducted using continual pre-training within the legal domain have underscored the efficacy of our approach on the LegalGLUE benchmark in the English language. Pre-trained LLMs and code are freely available for use.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Linguistically informed masking for representation learning in the patent domain. In 2nd Workshop on Patent Text Mining and Semantic Technologies (PatentSemTech2021), New York, NY, USA. Association for Computing Machinery.
  2. Unilmv2: Pseudo-masked language models for unified language model pre-training.
  3. Douglas Biber and Susan Conrad. 2019. Register, Genre, and Style. Cambridge University Press. Google-Books-ID: x7OQDwAAQBAJ.
  4. LEGAL-BERT: The muppets straight out of law school. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2898–2904, Online. Association for Computational Linguistics.
  5. LEGAL-BERT: The Muppets straight out of Law School. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2898–2904, Online. Association for Computational Linguistics.
  6. LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15513–15535, Toronto, Canada. Association for Computational Linguistics.
  7. LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4310–4330, Dublin, Ireland. Association for Computational Linguistics.
  8. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  9. Realm: Retrieval-augmented language model pre-training.
  10. Nicolas Hernandez and Brigitte Grau. 2003. Automatic extraction of meta-descriptors for text description. In International Conference on Recent Advances In Natural Language Processing (RANLP), Borovets, Bulgaria.
  11. Ken Hyland. 1998. Persuasion and context: The pragmatics of academic metadiscourse. Journal of pragmatics, 30(4):437–455.
  12. Karen Spärck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11–21.
  13. SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77.
  14. Continual pre-training of language models. In The Eleventh International Conference on Learning Representations.
  15. DrBERT: A robust pre-trained model in French for biomedical and clinical domains. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16207–16221, Toronto, Canada. Association for Computational Linguistics.
  16. Pmi-masking: Principled masking of correlated spans.
  17. Yian Li and Hai Zhao. 2021. Pre-training universal language representation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5122–5133, Online. Association for Computational Linguistics.
  18. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
  19. ERNIE: Enhanced Representation through Knowledge Integration. ArXiv:1904.09223 [cs].
  20. Ernie: Enhanced representation through knowledge integration.
  21. Pretrained language model in continual learning: A comparative study. In International Conference on Learning Representations.
  22. Learning Better Masking for Better Language Model Pre-training. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7255–7267, Toronto, Canada. Association for Computational Linguistics.
  23. Opt: Open pre-trained transformer language models.

Summary

We haven't generated a summary for this paper yet.