How Important Is Tokenization in French Medical Masked Language Models? (2402.15010v2)
Abstract: Subword tokenization has become the prevailing standard in the field of NLP over recent years, primarily due to the widespread utilization of pre-trained LLMs. This shift began with Byte-Pair Encoding (BPE) and was later followed by the adoption of SentencePiece and WordPiece. While subword tokenization consistently outperforms character and word-level tokenization, the precise factors contributing to its success remain unclear. Key aspects such as the optimal segmentation granularity for diverse tasks and languages, the influence of data sources on tokenizers, and the role of morphological information in Indo-European languages remain insufficiently explored. This is particularly pertinent for biomedical terminology, characterized by specific rules governing morpheme combinations. Despite the agglutinative nature of biomedical terminology, existing LLMs do not explicitly incorporate this knowledge, leading to inconsistent tokenization strategies for common terms. In this paper, we seek to delve into the complexities of subword tokenization in French biomedical domain across a variety of NLP tasks and pinpoint areas where further enhancements can be made. We analyze classical tokenization algorithms, including BPE and SentencePiece, and introduce an original tokenization strategy that integrates morpheme-enriched word segmentation into existing tokenization methods.
- Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72–78, Minneapolis, Minnesota, USA. Association for Computational Linguistics.
- Issam Bazzi and James R Glass. 2002. A multi-class approach for modelling out-of-vocabulary words. In Interspeech.
- SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China. Association for Computational Linguistics.
- AliBERT: A pre-trained language model for French biomedical text. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 223–236, Toronto, Canada. Association for Computational Linguistics.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- William Chen and Brett Fazio. 2021. Morphologically-guided segmentation for translation of agglutinative low-resource languages. In Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021), pages 20–31, Virtual. Association for Machine Translation in the Americas.
- Contextualized French language models for biomedical named entity recognition. In Actes de la 6e conférence conjointe Journées d’Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Atelier DÉfi Fouille de Textes, pages 36–48, Nancy, France. ATALA et AFCP.
- H. Cottez. 1980. Dictionnaire des structures du vocabulaire savant: éléments et modèles de formation. Collection Les usuels du Robert. Le Robert.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Re-train or train from scratch? comparing pre-training strategies of BERT in the medical domain. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2626–2633, Marseille, France. European Language Resources Association.
- Allison Fan and Weiwei Sun. 2023. Constructivist tokenization for English. In Proceedings of the First International Workshop on Construction Grammars and NLP (CxGs+NLP, GURT/SyntaxFest 2023), pages 36–40, Washington, D.C. Association for Computational Linguistics.
- How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 39–49, Toronto, Canada. Association for Computational Linguistics.
- Philip Gage. 1994. A new algorithm for data compression. C Users J., 12(2):23–38.
- Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare, 3(1).
- An embarrassingly simple method to mitigate undesirable properties of pretrained language model tokenizers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 385–393, Dublin, Ireland. Association for Computational Linguistics.
- Biomedical language models are robust to sub-optimal tokenization. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 350–362, Toronto, Canada. Association for Computational Linguistics.
- Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, Melbourne, Australia. Association for Computational Linguistics.
- Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
- DrBERT: A robust pre-trained model in French for biomedical and clinical domains. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16207–16221, Toronto, Canada. Association for Computational Linguistics.
- Drbenchmark: A large language understanding evaluation benchmark for french biomedical domain.
- Martha Larson. 2001. Sub-word-based language models for speech recognition: implications for spoken document retrieval. Whorkshop on Language Modeling and Information Retrieval.
- BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
- Ro{bert}a: A robustly optimized {bert} pretraining approach.
- Between words and characters: A brief history of open-vocabulary modeling and tokenization in nlp.
- One size does not fit all: Finding the optimal subword sizes for FastText models across languages. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 1068–1074, Held Online. INCOMA Ltd.
- Morphological word segmentation on agglutinative languages for neural machine translation.
- A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics, 8:842–866.
- David Samuel and Lilja Øvrelid. 2023. Tokenization with factorized subword encoding. In Findings of the Association for Computational Linguistics: ACL 2023, pages 14143–14161, Toronto, Canada. Association for Computational Linguistics.
- Mike Schuster and Kaisuke Nakajima. 2012. Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5149–5152.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
- Sub-word modeling of out of vocabulary words in spoken term detection. In 2008 IEEE Spoken Language Technology Workshop, pages 273–276. IEEE.
- Impact of tokenization on language models: An analysis for turkish. ACM Trans. Asian Low-Resour. Lang. Inf. Process., 22(4).
- Camembert-bio: a tasty french language model better for your health.
- Christian Touratier. 2012. Chapitre V. Les classes de morphèmes. In Morphologie et morphématique : Analyse en morphèmes, Langues et langage, pages 78–114. Presses universitaires de Provence, Aix-en-Provence.
- Word, subword or character? an empirical study of granularity in chinese-english nmt. In Machine Translation, pages 30–42, Singapore. Springer Singapore.
- Google’s neural machine translation system: Bridging the gap between human and machine translation.
- Présentation de la campagne d’évaluation DEFT 2020 : similarité textuelle en domaine ouvert et extraction d’information précise dans des cas cliniques (presentation of the DEFT 2020 challenge : open domain textual similarity and precise information extraction from clinical cases ). In Actes de la 6e conférence conjointe Journées d’Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Atelier DÉfi Fouille de Textes, pages 1–13, Nancy, France. ATALA et AFCP.
- Supervised learning for the detection of negation and of its scope in French and Brazilian Portuguese biomedical corpora. Natural Language Engineering, 27(2):181–201.
- CAS: French Corpus with Clinical Cases. In Proceedings of the 9th International Workshop on Health Text Mining and Information Analysis (LOUHI), pages 1–7, Brussels, Belgium.
- Classification de cas cliniques et évaluation automatique de réponses d’étudiants : présentation de la campagne DEFT 2021 (clinical cases classification and automatic evaluation of student answers : Presentation of the DEFT 2021 challenge). In Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Atelier DÉfi Fouille de Textes (DEFT), pages 1–13, Lille, France. ATALA.
- CLISTER : A corpus for semantic textual similarity in French clinical narratives. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4306–4315, Marseille, France. European Language Resources Association.
- A spoken drug prescription dataset in french for spoken language understanding. In 13th Language Resources and Evaluation Conference (LREC 2022).
- A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC. Journal of the American Medical Informatics Association, 22(5):948–956.
- MORFITT : A multi-label corpus of French scientific articles in the biomedical domain. In 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN) Atelier sur l’Analyse et la Recherche de Textes Scientifiques, Paris, France. Florian Boudin.
- The Unified Medical Language System. Methods Inf Med, 32(4):281–291.
- The e3c project: Collection and annotation of a multilingual corpus of clinical cases. Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020.
- Hiroki Nakayama. 2018. seqeval: A python framework for sequence labeling evaluation. Software available from https://github.com/chakki-works/seqeval.
- The QUAERO French medical corpus: A ressource for medical entity recognition and normalization. In Proc of BioTextMining Work, pages 24–30.
- Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing, Manchester, UK.
- CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4003–4012, Marseille, France. European Language Resources Association.
- Yanis Labrak (12 papers)
- Adrien Bazoge (6 papers)
- Beatrice Daille (15 papers)
- Mickael Rouvier (25 papers)
- Richard Dufour (33 papers)