Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains (2402.10373v3)

Published 15 Feb 2024 in cs.CL, cs.AI, and cs.LG
BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains

Abstract: LLMs have demonstrated remarkable versatility in recent years, offering potential applications across specialized domains such as healthcare and medicine. Despite the availability of various open-source LLMs tailored for health contexts, adapting general-purpose LLMs to the medical domain presents significant challenges. In this paper, we introduce BioMistral, an open-source LLM tailored for the biomedical domain, utilizing Mistral as its foundation model and further pre-trained on PubMed Central. We conduct a comprehensive evaluation of BioMistral on a benchmark comprising 10 established medical question-answering (QA) tasks in English. We also explore lightweight models obtained through quantization and model merging approaches. Our results demonstrate BioMistral's superior performance compared to existing open-source medical models and its competitive edge against proprietary counterparts. Finally, to address the limited availability of data beyond English and to assess the multilingual generalization of medical LLMs, we automatically translated and evaluated this benchmark into 7 other languages. This marks the first large-scale multilingual evaluation of LLMs in the medical domain. Datasets, multilingual evaluation benchmarks, scripts, and all the models obtained during our experiments are freely released.

Enhancing Medical Domain Understanding with BioMistral: Open-Source Pretrained LLMs

Introduction to BioMistral

The paper presents BioMistral, a set of Open-Source Pretrained LLMs optimized for applications within the medical domain. Based on the Mistral foundation model and enriched by further pretraining on PubMed Central, BioMistral represents a significant step towards making robust, domain-specific NLP capabilities more accessible to researchers and practitioners in the field of healthcare and medicine.

Distinctive Features of BioMistral

BioMistral introduces several innovations and improvements over existing medical LLMs:

  • Tailored Domain Optimization: Through further pre-training Mistral on a meticulously curated subset of PubMed Central, BioMistral achieves superior performance on a wide array of medical QA tasks.
  • Multilingual Evaluation: It expands the evaluation landscape by translating a benchmark of 10 medical QA tasks into seven languages, thus assessing the multilingual efficacy of medical LLMs at a scale previously unexplored.
  • Efficiency through Quantization: Through various quantization and model merging techniques, BioMistral models exhibit not just excellence in performance but also in operational efficiency, making them amenable for deployment on consumer-grade hardware.

Comprehensive Evaluation

BioMistral underwent a rigorous evaluation on a novel benchmark comprising 10 medical QA tasks. It demonstrated statistically significant improvements over other open-source medical models and holds its ground against proprietary models in terms of performance. In multilingual contexts, although there's an observable performance dip compared to English tasks, BioMistral's impressive array of LLMs still outperforms existing models, underscoring its robustness and adaptability across linguistic boundaries.

The Mechanics of Model Adaptation

The adaptation method involves pre-training the Mistral model using a dataset drawn from the PMC Open Access Subset to embed biomedical specificity into BioMistral. This process, aimed at enhancing the model's understanding of complex medical contexts, employs various strategies including AdamW optimization and Grouped-Query Attention, ensuring the model's adeptness at medical domain tasks.

Model Merging and Quantization Strategies

Model merging experiments, using techniques such as SLERP and TIES, indicated that combining specialized and general-domain models can result in improved performance and generalization capabilities. Furthermore, experiments with activation-aware weight quantization and other strategies underscore the potential for deploying BioMistral on devices with limited computational resources without significant loss in performance.

Practical Implications and Future Prospects

BioMistral holds promise for a variety of applications in healthcare and medicine, from enhancing medical literature search capabilities to facilitating patient care through improved understanding of medical queries. Its open-source nature invites further experimentation and adaptation by the global research community. The work paves the way for future developments, particularly in advancing model calibration, reliability, and multilingual capabilities, as well as exploring domain-specific adaptations beyond the sphere of medicine.

Key Contributions

  • Domain-Specific Pretraining: Leveraging PubMed Central to train Mistral model variants tailored for the biomedical domain.
  • Multilingual Benchmark Creation: Extending the evaluation of medical LLMs to additional languages.
  • Advanced Model Quantization: Implementing quantization techniques that allow performance optimization without sacrificing accuracy.

Conclusion

BioMistral represents a significant advancement in the development of domain-specific LLMs for the biomedical field, showing marked improvements over existing models across a range of metrics. By combining the foundational strengths of Mistral with advanced pre-training and model optimization techniques, BioMistral emerges as a powerful tool for researchers and practitioners working at the intersection of AI and healthcare. The open-source release of datasets, benchmarks, and models underlines the authors' commitment to transparency and collaboration in advancing the state of the art in medical NLP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, Singapore. Association for Computational Linguistics.
  2. Git re-basin: Merging models modulo permutation symmetries. In The Eleventh International Conference on Learning Representations.
  3. The falcon series of open language models.
  4. Ensemble of averages: Improving model selection and boosting performance in domain generalization. In Advances in Neural Information Processing Systems.
  5. Open llm leaderboard. Hugging Face.
  6. Longformer: The long-document transformer.
  7. GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pages 95–136, virtual+Dublin. Association for Computational Linguistics.
  8. Embert: A pre-trained language model for chinese medical text mining. In Web and Big Data, pages 242–257, Cham. Springer International Publishing.
  9. Pretrained biomedical language models for clinical NLP in Spanish. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 193–199, Dublin, Ireland. Association for Computational Linguistics.
  10. Swad: Domain generalization by seeking flat minima. In Advances in Neural Information Processing Systems, volume 34, pages 22405–22418. Curran Associates, Inc.
  11. Meditron-70b: Scaling medical pretraining for large language models.
  12. Fusing finetuned models for better pretraining.
  13. Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning.
  14. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell, 6:1169595.
  15. GPT3.int8(): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems.
  16. QLoRA: Efficient finetuning of quantized LLMs. In Thirty-seventh Conference on Neural Information Processing Systems.
  17. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  18. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 862–872.
  19. Bigbio: A framework for data-centric biomedical natural language processing. In Advances in Neural Information Processing Systems, volume 35, pages 25792–25806. Curran Associates, Inc.
  20. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare, 3(1).
  21. Large language models to identify social determinants of health in electronic health records. npj Digital Medicine, 7(1):6.
  22. Multi-source domain adaptation with mixture of experts. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4694–4703, Brussels, Belgium. Association for Computational Linguistics.
  23. Medalpaca – an open-source collection of medical conversational ai models and training data.
  24. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3309–3326, Dublin, Ireland. Association for Computational Linguistics.
  25. A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics.
  26. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  27. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations.
  28. Patching open-vocabulary models by interpolating weights. In Advances in Neural Information Processing Systems.
  29. Harold Jeffreys. 1946. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 186(1007):453–461.
  30. Mistral 7b.
  31. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.
  32. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, Hong Kong, China. Association for Computational Linguistics.
  33. Dataless knowledge fusion by merging weights of language models. In The Eleventh International Conference on Learning Representations.
  34. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
  35. DrBERT: A robust pre-trained model in French for biomedical and clinical domains. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16207–16221, Toronto, Canada. Association for Computational Linguistics.
  36. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
  37. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge.
  38. Holistic evaluation of language models. Transactions on Machine Learning Research. Featured Certification, Expert Certification.
  39. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978.
  40. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
  41. Can large language models reason about medical questions?
  42. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
  43. ClinicalT5: A generative language model for clinical text. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5436–5443, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  44. Estimating the carbon footprint of bloom, a 176b parameter language model.
  45. Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6).
  46. Findings of the WMT 2023 biomedical translation shared task: Evaluation of ChatGPT 3.5 as a comparison system. In Proceedings of the Eighth Conference on Machine Translation, pages 43–54, Singapore. Association for Computational Linguistics.
  47. Capabilities of gpt-4 on medical challenge problems.
  48. Can generalist foundation models outcompete special-purpose tuning? case study in medicine.
  49. OpenAI. 2023. Chatgpt: Language models are few-shot learners. https://openai.com/blog/chatgpt. Accessed: 2024-02-10.
  50. Gpt-4 technical report.
  51. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Proceedings of the Conference on Health, Inference, and Learning, volume 174 of Proceedings of Machine Learning Research, pages 248–260. PMLR.
  52. Scifive: a text-to-text transformer model for biomedical literature.
  53. John Platt et al. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74.
  54. Language models are unsupervised multitask learners.
  55. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems.
  56. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  57. Ken Shoemake. 1985. Animating rotation with quaternion curves. In Proceedings of the 12th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’85, page 245–254, New York, NY, USA. Association for Computing Machinery.
  58. Sidak Pal Singh and Martin Jaggi. 2020. Model fusion via optimal transport. In Advances in Neural Information Processing Systems, volume 33, pages 22045–22055. Curran Associates, Inc.
  59. Large language models encode clinical knowledge. Nature, 620(7972):172–180.
  60. Towards expert-level medical question answering with large language models.
  61. Evaluating and mitigating discrimination in language model decisions.
  62. Gemini: A family of highly capable multimodal models.
  63. CamemBERT-bio : Un modèle de langue français savoureux et meilleur pour la santé. In Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux – articles longs, pages 323–334, Paris, France. ATALA.
  64. Llama: Open and efficient foundation language models.
  65. Llama 2: Open foundation and fine-tuned chat models.
  66. Zephyr: Direct distillation of lm alignment.
  67. Clinicalgpt: Large language models finetuned with diverse medical data and comprehensive evaluation.
  68. Bloom: A 176b-parameter open-access multilingual language model.
  69. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 23965–23998. PMLR.
  70. Pmc-llama: Towards building open-source language models for medicine.
  71. TIES-merging: Resolving interference when merging models. In Thirty-seventh Conference on Neural Information Processing Systems.
  72. Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue.
  73. Language models are super mario: Absorbing abilities from homologous models as a free lunch.
  74. BioBART: Pretraining and evaluation of a biomedical generative language model. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 97–109, Dublin, Ireland. Association for Computational Linguistics.
  75. Biomedgpt: A unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks.
  76. Pytorch fsdp: Experiences on scaling fully sharded data parallel.
  77. Judging llm-as-a-judge with mt-bench and chatbot arena.
  78. A survey of large language models in medicine: Principles, applications, and challenges.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yanis Labrak (12 papers)
  2. Adrien Bazoge (6 papers)
  3. Emmanuel Morin (13 papers)
  4. Pierre-Antoine Gourraud (5 papers)
  5. Mickael Rouvier (25 papers)
  6. Richard Dufour (33 papers)
Citations (134)
Youtube Logo Streamline Icon: https://streamlinehq.com