Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Pre-trained Language Model Sensitivity via Mask Specific losses: A case study on Biomedical NER (2403.18025v2)

Published 26 Mar 2024 in cs.CL, cs.AI, cs.IR, and cs.LG

Abstract: Adapting LLMs (LMs) to novel domains is often achieved through fine-tuning a pre-trained LM (PLM) on domain-specific data. Fine-tuning introduces new knowledge into an LM, enabling it to comprehend and efficiently perform a target domain task. Fine-tuning can however be inadvertently insensitive if it ignores the wide array of disparities (e.g in word meaning) between source and target domains. For instance, words such as chronic and pressure may be treated lightly in social conversations, however, clinically, these words are usually an expression of concern. To address insensitive fine-tuning, we propose Mask Specific LLMing (MSLM), an approach that efficiently acquires target domain knowledge by appropriately weighting the importance of domain-specific terms (DS-terms) during fine-tuning. MSLM jointly masks DS-terms and generic words, then learns mask-specific losses by ensuring LMs incur larger penalties for inaccurately predicting DS-terms compared to generic words. Results of our analysis show that MSLM improves LMs sensitivity and detection of DS-terms. We empirically show that an optimal masking rate not only depends on the LM, but also on the dataset and the length of sequences. Our proposed masking strategy outperforms advanced masking strategies such as span- and PMI-based masking.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Detect and classify–joint span detection and classification for health outcomes. arXiv preprint arXiv:2104.07789.
  2. Correcting crowdsourced annotations to improve detection of outcome types in evidence based medicine. In CEUR workshop proceedings, volume 2429, pages 1–5.
  3. Position-based prompting for health outcome generation. arXiv preprint arXiv:2204.03489.
  4. Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323.
  5. Hyunah Baek. 2022. Prosodic disambiguation in first and second language production: English and korean. Language and Speech, 65(3):598–624.
  6. Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676.
  7. Meta-learning via language model in-context tuning. arXiv preprint arXiv:2110.07814.
  8. Quac: Question answering in context. arXiv preprint arXiv:1808.07036.
  9. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.
  10. Crossfit overview: systematic review and meta-analysis. Sports medicine-open, 4(1):1–14.
  11. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9268–9277.
  12. Ernest Davis and Gary Marcus. 2015. Commonsense reasoning and commonsense knowledge in artificial intelligence. Communications of the ACM, 58:92–103.
  13. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
  14. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305.
  15. Markus Eberts and Adrian Ulges. 2021. An end-to-end model for entity-level relation extraction using multi-instance learning. arXiv preprint arXiv:2102.05980.
  16. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723.
  17. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23.
  18. The conll-2009 shared task: Syntactic and semantic dependencies in multiple languages. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task, pages 1–18.
  19. Cogstack-experiences of deploying integrated information retrieval and extraction services in a large national health service foundation trust hospital. BMC medical informatics and decision making, 18(1):1–13.
  20. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9.
  21. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the association for computational linguistics, 8:64–77.
  22. Mixout: Effective regularization to finetune large-scale pretrained language models. arXiv preprint arXiv:1909.11299.
  23. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
  24. Pmi-masking: Principled masking of correlated spans. arXiv preprint arXiv:2010.01825.
  25. Probabilistically masked language model capable of autoregressive generation in arbitrary word order. arXiv preprint arXiv:2004.11579.
  26. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730.
  27. On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines. arXiv preprint arXiv:2006.04884.
  28. Roberto Navigli. 2009. Word sense disambiguation: A survey. ACM computing surveys (CSUR), 41(2):1–69.
  29. Scispacy: fast and robust models for biomedical natural language processing. arXiv preprint arXiv:1902.07669.
  30. An empirical study of multi-task learning on bert for biomedical text mining. arXiv preprint arXiv:2005.02799.
  31. Transfer learning in biomedical natural language processing: an evaluation of bert and elmo on ten benchmarking datasets. arXiv preprint arXiv:1906.05474.
  32. To tune or not to tune? adapting pretrained representations to diverse tasks. arXiv preprint arXiv:1903.05987.
  33. Inexpensive domain adaptation of pretrained language models: Case studies on biomedical ner and covid-19 qa. arXiv preprint arXiv:2004.03354.
  34. Bioelectra: pretrained biomedical text encoder using discriminators. In Proceedings of the 20th Workshop on Biomedical Language Processing, pages 143–154.
  35. Erik F Sang and Jorn Veenstra. 1999. Representing text chunks. arXiv preprint cs/9907006.
  36. Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223.
  37. Learning to model the tail. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 7032–7042.
  38. Jianshu Weng and Bu-Sung Lee. 2011. Event detection in twitter. In Proceedings of the international aaai conference on web and social media, volume 5, pages 401–408.
  39. Should you mask 15% in masked language modeling? arXiv preprint arXiv:2202.08005.
  40. Clinical abbreviation disambiguation using neural word embeddings. In Proceedings of BioNLP 15, pages 171–176.
  41. Docred: A large-scale document-level relation extraction dataset. arXiv preprint arXiv:1906.06127.
  42. The construction of sentiment lexicon based on context-dependent part-of-speech chunks for semantic disambiguation. IEEE Access, 8:63359–63367.
  43. Pdaln: Progressive domain adaptation over a pre-trained model for low-resource cross-domain named entity recognition. In EMNLP.
  44. Learning sense-specific static embeddings using contextualised word embeddings as a proxy. arXiv preprint arXiv:2110.02204.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Micheal Abaho (3 papers)
  2. Danushka Bollegala (84 papers)
  3. Gary Leeming (1 paper)
  4. Dan Joyce (2 papers)
  5. Iain E Buchan (1 paper)

Summary

We haven't generated a summary for this paper yet.