Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain (2404.07613v1)

Published 11 Apr 2024 in cs.CL, cs.AI, and cs.LG
Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain

Abstract: Research on language technology for the development of medical applications is currently a hot topic in Natural Language Understanding and Generation. Thus, a number of LLMs have recently been adapted to the medical domain, so that they can be used as a tool for mediating in human-AI interaction. While these LLMs display competitive performance on automated medical texts benchmarks, they have been pre-trained and evaluated with a focus on a single language (English mostly). This is particularly true of text-to-text models, which typically require large amounts of domain-specific pre-training data, often not easily accessible for many languages. In this paper, we address these shortcomings by compiling, to the best of our knowledge, the largest multilingual corpus for the medical domain in four languages, namely English, French, Italian and Spanish. This new corpus has been used to train Medical mT5, the first open-source text-to-text multilingual model for the medical domain. Additionally, we present two new evaluation benchmarks for all four languages with the aim of facilitating multilingual research in this domain. A comprehensive evaluation shows that Medical mT5 outperforms both encoders and similarly sized text-to-text models for the Spanish, French, and Italian benchmarks, while being competitive with current state-of-the-art LLMs in English.

Elevating Medical NLP with Multilingual LLM: Insights from Medical mT5 Development

Introduction

Recent advances in AI and NLP have significantly improved the capabilities of LLMs in various domains, including medicine. However, most of these developments have been confined to the English language, leaving a notable gap in resources and tools for non-English medical texts. Addressing this imbalance, the paper presents Medical mT5, a pioneering open-source text-to-text multilingual model fine-tuned on medical domain data across English, Spanish, French, and Italian. This model is an encoder-decoder framework based on the mT5 architecture, demonstrating state-of-the-art performance in multilingual sequence labeling for the medical domain.

Multilingual Corpus Compilation

The foundation of Medical mT5's success lies in the assembly of a diverse and extensive multilingual corpus tailored to the medical domain. This corpus, touted as the largest of its kind, encompasses 3 billion words across four languages. It blends data from various sources, including clinical trials, PubMed articles, and medical instructions, ensuring a comprehensive representation of the medical lexicon. This corpus not only facilitates the training of Medical mT5 but also sets a new benchmark for multilingual medical NLP research.

Medical mT5 Model Development

Building upon the mT5 framework, Medical mT5 underwent continued pre-training on the assembled multilingual medical corpus. This process involved adapting the model to recognize and process medical terminology consistently across the covered languages. The development yielded two versions of Medical mT5, one with 770M parameters and another with 3B parameters, to cater to different computational capabilities and use cases. The model's architecture and training approach are carefully designed, ensuring it remains accessible to a broad range of researchers and practitioners, especially considering the comparatively low hardware requirements for both training and inference.

Benchmark Creation and Evaluation

To effectively gauge Medical mT5's performance, the research contributes two novel multilingual datasets for sequence labeling and generative question answering in the medical domain. These benchmarks challenge the model across multiple tasks, including Argument Mining and Abstractive Question Answering, facilitating a rigorous and comprehensive evaluation. Medical mT5 demonstrated exemplary performance, surpassing similarly-sized models on non-English benchmarks and achieving competitive results in English, showcasing its robustness and versatility across languages.

Implications and Future Directions

The implications of this research extend far beyond its immediate achievements. Medical mT5 paves the way for more inclusive and equitable medical NLP applications, breaking the English-centric mold that has dominated the field. It highlights the importance of developing multilingual tools that can support medical professionals and patients across diverse linguistic backgrounds. Looking ahead, this work could inspire further efforts to expand the corpus to include more languages and refine the model to tackle a broader range of medical NLP tasks.

Conclusion

The development of Medical mT5 marks a significant step forward in multilingual NLP for the medical domain. By leveraging a vast multilingual medical corpus, this model not only achieves state-of-the-art results in sequence labelling and question answering but also demonstrates the feasibility and importance of extending NLP research and applications to non-English languages in the medical field. Future research will undoubtedly build on this foundation, further enhancing the capabilities of NLP technologies to serve global medical communities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. MasakhaNER 2.0: Africa-centric transfer learning for named entity recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4488–4508, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  2. Rodrigo Agerri and Eneko Agirre. 2023. Lessons learned from the evaluation of Spanish Language Models. Proces. del Leng. Natural, 70:157–170.
  3. SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  5. Pretrained biomedical language models for clinical NLP in Spanish. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 193–199, Dublin, Ireland. Association for Computational Linguistics.
  6. Scaling instruction-finetuned language models. CoRR, abs/2210.11416.
  7. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 8440–8451. Association for Computational Linguistics.
  8. No language left behind: Scaling human-centered machine translation. CoRR, abs/2207.04672.
  9. NCBI disease corpus: A resource for disease name recognition and concept normalization. J. Biomed. Informatics, 47:1–10.
  10. Overview of the DIANN task: Disability annotation task. In Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018) co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN 2018), Sevilla, Spain, September 18th, 2018, volume 2150 of CEUR Workshop Proceedings, pages 1–14. CEUR-WS.org.
  11. T-projection: High quality annotation projection for sequence labeling tasks. CoRR, abs/2212.10548.
  12. PharmaCoNER: Pharmacological substances, compounds and proteins named entity recognition track. In Proceedings of The 5th Workshop on BioNLP Open Shared Tasks, BioNLP-OST@EMNLP-IJNCLP 2019, Hong Kong, China, November 4, 2019, pages 1–10. Association for Computational Linguistics.
  13. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Heal., 3(1):2:1–2:23.
  14. DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
  15. LoRA: Low-rank adaptation of large language models. arXiv preprint, 2106.09685.
  16. Glot500: Scaling multilingual corpora and language models to 500 languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 1082–1117. Association for Computational Linguistics.
  17. Findings of the WMT 2017 biomedical translation shared task. In Proceedings of the Second Conference on Machine Translation, WMT 2017, Copenhagen, Denmark, September 7-8, 2017, pages 234–247. Association for Computational Linguistics.
  18. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinform., 36(4):1234–1240.
  19. Biocreative V CDR task corpus: a resource for chemical disease relation extraction. Database J. Biol. Databases Curation, 2016.
  20. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6).
  21. The E3C project: European clinical case corpus. In Proceedings of the Annual Conference of the Spanish Association for Natural Language Processing: Projects and Demonstrations (SEPLN-PD 2021) co-located with the Conference of the Spanish Society for Natural Language Processing (SEPLN 2021), Málaga, Spain, September, 2021, volume 2968 of CEUR Workshop Proceedings, pages 17–20. CEUR-WS.org.
  22. Transformer-based argument mining for healthcare applications. In ECAI 2020 - 24th European Conference on Artificial Intelligence, 29 August-8 September 2020, Santiago de Compostela, Spain, August 29 - September 8, 2020 - Including 10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020), volume 325 of Frontiers in Artificial Intelligence and Applications, pages 2108–2115. IOS Press.
  23. Enhancing evidence-based medicine with natural language argumentative analysis of clinical trials. Artificial Intelligence in Medicine, 118:102098.
  24. SciFive: a text-to-text transformer model for biomedical literature. CoRR, abs/2106.03598.
  25. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  26. Large language models encode clinical knowledge. arXiv preprint, abs/2212.13138.
  27. Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA).
  28. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
  29. LLaMA: Open and efficient foundation language models. arXiv preprint, 2302.13971.
  30. An overview of the BioASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinform., 16:138:1–138:28.
  31. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
  32. ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation. ArXiv preprint, abs/2306.09968.
  33. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  34. PMC-LLaMA: Towards building open-source language models for medicine. arXiv preprint, 2304.14454.
  35. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics.
  36. LinkBERT: Pretraining language models with document links. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 8003–8016. Association for Computational Linguistics.
  37. Anar Yeginbergenova and Rodrigo Agerri. 2023. Cross-lingual argument mining in the medical domain. arXiv preprint, abs/2301.10527.
  38. Ander Itxaurrondo. 2018. SPACCC: Spanish Clinical Case Corpus. Barcelona Supercomputing Center. PID https://github.com/PlanTL-GOB-ES/SPACCC.
  39. Common Crawl. 2022. Common Crawl. Common Crawl. PID https://commoncrawl.org/.
  40. Institute of Formal and Applied Linguistics. 2017. UFAL Medical Corpus v. 1.0. Charles University, Czech Republic. PID https://ufal.mff.cuni.cz/ufal_medical_corpus.
  41. National Library of Medicine. 2022a. Clinical Trials. National Library of Medicine. PID https://clinicaltrials.gov/.
  42. National Library of Medicine. 2022b. PubMed. National Library of Medicine. PID https://pubmed.ncbi.nlm.nih.gov.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Iker García-Ferrero (14 papers)
  2. Rodrigo Agerri (41 papers)
  3. Aitziber Atutxa Salazar (2 papers)
  4. Elena Cabrio (11 papers)
  5. Iker de la Iglesia (5 papers)
  6. Alberto Lavelli (6 papers)
  7. Bernardo Magnini (15 papers)
  8. Benjamin Molinet (1 paper)
  9. Johana Ramirez-Romero (1 paper)
  10. German Rigau (30 papers)
  11. Jose Maria Villa-Gonzalez (2 papers)
  12. Serena Villata (12 papers)
  13. Andrea Zaninello (3 papers)
Citations (11)