Adaptation of Biomedical and Clinical Pretrained Models to French Long Documents: A Comparative Study (2402.16689v1)
Abstract: Recently, pretrained LLMs based on BERT have been introduced for the French biomedical domain. Although these models have achieved state-of-the-art results on biomedical and clinical NLP tasks, they are constrained by a limited input sequence length of 512 tokens, which poses challenges when applied to clinical notes. In this paper, we present a comparative study of three adaptation strategies for long-sequence models, leveraging the Longformer architecture. We conducted evaluations of these models on 16 downstream tasks spanning both biomedical and clinical domains. Our findings reveal that further pre-training an English clinical model with French biomedical texts can outperform both converting a French biomedical BERT to the Longformer architecture and pre-training a French biomedical Longformer from scratch. The results underscore that long-sequence French biomedical models improve performance across most downstream tasks regardless of sequence length, but BERT based models remain the most efficient for named entity recognition tasks.
- Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676.
- Longformer: The long-document transformer. arXiv:2004.05150.
- AliBERT: A pre-trained language model for French biomedical text. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 223–236, Toronto, Canada. Association for Computational Linguistics.
- Medbert.de: A comprehensive german bert model for the medical domain. arXiv preprint arXiv:2303.08179. Keno K. Bressem and Jens-Michalis Papaioannou and Paul Grundmann contributed equally.
- Présentation de la campagne d’évaluation DEFT 2020 : similarité textuelle en domaine ouvert et extraction d’information précise dans des cas cliniques (presentation of the DEFT 2020 challenge : open domain textual similarity and precise information extraction from clinical cases ). In Actes de la 6e conférence conjointe Journées d’Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Atelier DÉfi Fouille de Textes, pages 1–13, Nancy, France. ATALA et AFCP.
- Biomedical and clinical language models for spanish: On the benefits of domain-specific pretraining in a mid-resource scenario. arXiv preprint arXiv:2109.03570.
- Masked language modeling for proteins via linearly scalable long-context transformers. arXiv preprint arXiv:2006.03555.
- Rethinking attention with performers. arXiv preprint arXiv:2009.14794.
- Charles Condevaux and Sébastien Harispe. 2023. Lsg attention: Extrapolation of pretrained transformers to long sequences. In Advances in Knowledge Discovery and Data Mining, pages 443–454, Cham. Springer Nature Switzerland.
- Supervised learning for the detection of negation and of its scope in French and Brazilian Portuguese biomedical corpora. Natural Language Engineering, 27(2):181–201.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Bigbio: A framework for data-centric biomedical natural language processing. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Limitations of transformers on clinical text classification. IEEE journal of biomedical and health informatics, 25(9):3596–3607.
- CAS: French Corpus with Clinical Cases. In Proceedings of the 9th International Workshop on Health Text Mining and Information Analysis (LOUHI), pages 1–7, Brussels, Belgium.
- Classification de cas cliniques et évaluation automatique de réponses d’étudiants : présentation de la campagne DEFT 2021 (clinical cases classification and automatic evaluation of student answers : Presentation of the DEFT 2021 challenge). In Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Atelier DÉfi Fouille de Textes (DEFT), pages 1–13, Lille, France. ATALA.
- Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare, 3(1).
- Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247.
- A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics.
- CLISTER : A corpus for semantic textual similarity in French clinical narratives. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4306–4315, Marseille, France. European Language Resources Association.
- Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342.
- Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9.
- FrenchMedMCQA: A French Multiple-Choice Question Answering Dataset for Medical domain. In Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI), Abou Dhabi, United Arab Emirates.
- DrBERT: A robust pre-trained model in French for biomedical and clinical domains. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16207–16221, Toronto, Canada. Association for Computational Linguistics.
- Drbenchmark: A large language understanding evaluation benchmark for french biomedical domain. arXiv preprint arXiv:2402.13432.
- MORFITT : A multi-label corpus of French scientific articles in the biomedical domain. In 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN) Atelier sur l’Analyse et la Recherche de Textes Scientifiques, Paris, France. Florian Boudin.
- A zero-shot and few-shot study of instruction-finetuned large language models applied to clinical and biomedical tasks. arXiv preprint arXiv:2307.12114.
- BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
- A comparative study of pretrained language models for long clinical text. Journal of the American Medical Informatics Association, 30(2):340–347.
- The Unified Medical Language System. Methods Inf Med, 32(4):281–291.
- Luna: Linear unified nested attention. Advances in Neural Information Processing Systems, 34:2441–2453.
- The e3c project: Collection and annotation of a multilingual corpus of clinical cases. Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020.
- CamemBERT: a tasty French language model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7203–7219, Online. Association for Computational Linguistics.
- Mixed precision training. CoRR, abs/1710.03740.
- Hiroki Nakayama. 2018. seqeval: A python framework for sequence labeling evaluation. Software available from https://github.com/chakki-works/seqeval.
- The QUAERO French medical corpus: A ressource for medical entity recognition and normalization. In Proc of BioTextMining Work, pages 24–30.
- Random feature attention. In International Conference on Learning Representations.
- Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets. In Proceedings of the 18th BioNLP Workshop and Shared Task, pages 58–65, Florence, Italy. Association for Computational Linguistics.
- Prompt-based extraction of social determinants of health using few-shot learning. In Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 385–393, Toronto, Canada. Association for Computational Linguistics.
- BioBERTpt - a Portuguese neural language model for clinical named entity recognition. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, pages 65–72, Online. Association for Computational Linguistics.
- Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617.
- Sparse sinkhorn attention. In International Conference on Machine Learning, pages 9438–9447. PMLR.
- Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations.
- Efficient transformers: A survey. ACM Comput. Surv., 55(6).
- Camembert-bio: a tasty french language model better for your health.
- Bioberturk: Exploring turkish biomedical language model development strategies in low resource setting. Preprint from Research Square.
- Attention is all you need. Advances in neural information processing systems, 30.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768.
- Huggingface’s transformers: State-of-the-art natural language processing.
- Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454.
- Google’s neural machine translation system: Bridging the gap between human and machine translation.
- Zero-shot temporal relation extraction with ChatGPT. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 92–102, Toronto, Canada. Association for Computational Linguistics.
- Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297.
- Improving the transferability of clinical note section classification models with BERT and large language model ensembles. In Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 125–130, Toronto, Canada. Association for Computational Linguistics.
- Adrien Bazoge (6 papers)
- Emmanuel Morin (13 papers)
- Beatrice Daille (15 papers)
- Pierre-Antoine Gourraud (5 papers)