Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generalist embedding models are better at short-context clinical semantic search than specialized embedding models (2401.01943v2)

Published 3 Jan 2024 in cs.CL and cs.AI
Generalist embedding models are better at short-context clinical semantic search than specialized embedding models

Abstract: The increasing use of tools and solutions based on LLMs for various tasks in the medical domain has become a prominent trend. Their use in this highly critical and sensitive domain has thus raised important questions about their robustness, especially in response to variations in input, and the reliability of the generated outputs. This study addresses these questions by constructing a textual dataset based on the ICD-10-CM code descriptions, widely used in US hospitals and containing many clinical terms, and their easily reproducible rephrasing. We then benchmarked existing embedding models, either generalist or specialized in the clinical domain, in a semantic search task where the goal was to correctly match the rephrased text to the original description. Our results showed that generalist models performed better than clinical models, suggesting that existing clinical specialized models are more sensitive to small changes in input that confuse them. The highlighted problem of specialized models may be due to the fact that they have not been trained on sufficient data, and in particular on datasets that are not diverse enough to have a reliable global language understanding, which is still necessary for accurate handling of medical documents.

Introduction

In the landscape of medical informatics, embedding models serve as fundamental tools in semantic search tasks—processes vital for the retrieval of clinical information from vast datasets. Such models convert text into numerical vectors, which can be compared to find the most similar pieces of content. A recent evaluation focused on a comparison between general LLMs and those specialized for clinical purposes, examining their performance in semantic search tasks using clinical diagnostic information from ICD-10-CM codes.

Methodology and Dataset

The ICD-10-CM codes, a cornerstone in U.S. hospital systems for coding diagnoses, provided the foundation for this paper. A dataset was generated consisting of 100 ICD-10-CM codes, each with a main description and ten reformulated phrases intended to simulate how varied wording can appear in genuine medical documents. LLM ChatGPT 3.5 turbo produced these rephrasings, deliberately diversifying from the original descriptions. The selected models underwent performance tests using these rephrasings as queries in a semantic search task to match them with the appropriate ICD-10-CM code description.

Two central conditions governed the choice of models: the requirement for CPU-only operability for widespread accessibility and cost-effectiveness, and the preference for free and commonly used models from established repositories.

Results

When the results came in, generalist models like jina-embeddings-v2-base-en outpaced their specialized counterparts by significant margins across exact and category matching and character error rate (CER) metrics. The leading generalist model exhibited an exact matching rate of 84.0%, starkly higher than the top-performing specialized model, ClinicalBERT, at 64.4%. Such outcomes paint a nuanced picture: while clinical embedding models are honed for medical terminology, it is the generalist models, with their exposure to a broader linguistic landscape, that demonstrate greater resilience against variations in clinical text.

Conclusion and Implications

The inference drawn from this head-to-head pits generalist models as more adept at the task of short-context clinical semantic search than their specialized analogs. The breadth of training data, including non-medical content, seems to endow these models with superior versatility to grasp nuanced language use as found in healthcare settings. The findings resonate with current dialogues on LLM utility in clinical applications, suggesting that for certain tasks, a robust general language understanding may be more valuable than specialized knowledge. With this new insight, future research may explore wider or deeper contexts, perhaps tapping into full-length medical documents or benchmarking against newer, more advanced models. The research affirms that the path to refining LLMs for medical use may well rely on their ability to navigate a diverse array of human language.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Foundation models for generalist medical artificial intelligence. Nature, 616(7956):259–265, 2023.
  2. Large ai models in health informatics: Applications, challenges, and the future. IEEE Journal of Biomedical and Health Informatics, 2023.
  3. Large language models in medicine: the potentials and pitfalls. arXiv preprint arXiv:2309.00087, 2023.
  4. The emerging role of generative artificial intelligence in medical education, research, and practice. Cureus, 15(6), 2023.
  5. Chatgpt and large language model (llm) chatbots: the current state of acceptability and a proposal for guidelines on utilization in academic medicine. Journal of Pediatric Urology, 2023.
  6. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023.
  7. Large language models in medicine. Nature medicine, 29(8):1930–1940, 2023.
  8. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  9. Semantic information retrieval on medical texts: Research challenges, survey, and open issues. ACM Computing Surveys (CSUR), 54(7):1–38, 2021.
  10. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  11. Faithful ai in medicine: A systematic review with large language models and beyond. medRxiv, 2023.
  12. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324, 2023.
  13. Stefan Harrer. Attention is not all you need: the complicated case of ethically using large language models in healthcare and medicine. EBioMedicine, 90, 2023.
  14. Are large language models ready for healthcare? a comparative study on clinical language understanding. arXiv preprint arXiv:2304.05368, 2023.
  15. A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics. arXiv preprint arXiv:2310.05694, 2023.
  16. Enriching unsupervised user embedding via medical concepts. In Conference on Health, Inference, and Learning, pages 63–78. PMLR, 2022.
  17. Icd-10: history and context. American Journal of Neuroradiology, 37(4):596–599, 2016.
  18. The road to icd-10-cm/pcs implementation: forecasting the transition for providers, payers, and other healthcare organizations. Perspectives in health information management/AHIMA, American Health Information Management Association, 9(Winter), 2012.
  19. Shashank Mohan Jain. Hugging face. In Introduction to Transformers for NLP: With the Hugging Face Library and Models to Solve Problems, pages 51–67. Springer, 2022.
  20. Improved methods to aid unsupervised evidence-based fact checking for online health news. Journal of Data Intelligence, 3(4):474–504, 2022.
  21. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
  22. Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers. arXiv preprint arXiv:2012.15828, 2020.
  23. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  24. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  25. Jina embeddings 2: 8192-token general-purpose text embeddings for long documents. arXiv preprint arXiv:2310.19923, 2023.
  26. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
  27. Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ digital medicine, 4(1):86, 2021.
  28. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342, 2019.
  29. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2020.
  30. Clinical outcome prediction from admission notes using self-supervised knowledge integration. arXiv preprint arXiv:2102.04110, 2021.
  31. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021.
  32. Clinical-longformer and clinical-bigbird: Transformers for long clinical sequences. arXiv preprint arXiv:2201.11838, 2022.
  33. A study of generative large language model for medical research and healthcare. arXiv preprint arXiv:2305.13523, 2023.
  34. A systematic literature review of automated icd coding and classification systems using discharge summaries. arXiv preprint arXiv:2107.10652, 2021.
  35. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452, 2023.
  36. Performance analysis of large language models for medical text summarization, 2023.
  37. Benchmarking large language models on cmexam–a comprehensive chinese medical exam dataset. arXiv preprint arXiv:2306.03030, 2023.
  38. The future landscape of large language models in medicine. Communications Medicine, 3(1):141, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jean-Baptiste Excoffier (3 papers)
  2. Tom Roehr (1 paper)
  3. Alexei Figueroa (3 papers)
  4. Keno Bressem (7 papers)
  5. Matthieu Ortala (2 papers)
  6. Jens-Michalis Papaioannou (7 papers)
Citations (2)