Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis (2402.01730v1)
Abstract: LLMs constitute a breakthrough state-of-the-art Artificial Intelligence technology which is rapidly evolving and promises to aid in medical diagnosis. However, the correctness and the accuracy of their returns has not yet been properly evaluated. In this work, we propose an LLM evaluation paradigm that incorporates two independent steps of a novel methodology, namely (1) multimodal LLM evaluation via structured interactions and (2) follow-up, domain-specific analysis based on data extracted via the previous interactions. Using this paradigm, (1) we evaluate the correctness and accuracy of LLM-generated medical diagnosis with publicly available multimodal multiple-choice questions(MCQs) in the domain of Pathology and (2) proceed to a systemic and comprehensive analysis of extracted results. We used GPT-4-Vision-Preview as the LLM to respond to complex, medical questions consisting of both images and text, and we explored a wide range of diseases, conditions, chemical compounds, and related entity types that are included in the vast knowledge domain of Pathology. GPT-4-Vision-Preview performed quite well, scoring approximately 84\% of correct diagnoses. Next, we further analyzed the findings of our work, following an analytical approach which included Image Metadata Analysis, Named Entity Recognition and Knowledge Graphs. Weaknesses of GPT-4-Vision-Preview were revealed on specific knowledge paths, leading to a further understanding of its shortcomings in specific areas. Our methodology and findings are not limited to the use of GPT-4-Vision-Preview, but a similar approach can be followed to evaluate the usefulness and accuracy of other LLMs and, thus, improve their use with further optimization.
- Abu-Salih, B. (2021). Domain-specific knowledge graphs: A survey. Journal of Network and Computer Applications, 185, 103076.
- A critical review for developing accurate and dynamic predictive models using machine learning methods in medicine and health care. Journal of medical systems, 41, 1–10.
- Sit: Self-supervised vision transformer. arXiv preprint arXiv:2104.03602, .
- Advances in medical image analysis with vision transformers: A comprehensive review. arXiv preprint arXiv:2301.03505, .
- Systematic review on the cost and cost-effectiveness of mhealth interventions supporting women during pregnancy. Women and Birth, 36, 3–10.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, .
- Assessment in medical education; what are we trying to achieve?. International Journal of Higher Education, 4, 139–144.
- The discovery of the pathophysiological aspects of atherosclerosis—a review. Acta Chirurgica Belgica, 101, 162–169.
- Sciner: extracting named entities from scientific literature. In Computational Science–ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part II 20 (pp. 308–321). Springer.
- A survey on knowledge graphs: Representation, acquisition, and applications. IEEE transactions on neural networks and learning systems, 33, 494–514.
- Transformers in vision: A survey. ACM computing surveys (CSUR), 54, 1–41.
- Medical image classification with convolutional neural network. In 2014 13th international conference on control automation robotics & vision (ICARCV) (pp. 844–848). IEEE.
- An evaluative study of objective structured clinical examination (osce): students and examiners perspectives. Advances in medical education and practice, (pp. 387–397).
- A survey of named entity recognition and classification. Lingvisticae Investigationes, 30, 3–26.
- Assessment methods in medical education. Teaching and teacher education, 23, 239–250.
- OpenAI (2023). Gpt-4 technical report. ArXiv, abs/2303.08774. URL: https://api.semanticscholar.org/CorpusID:257532815.
- Tailored explainability in medical artificial intelligence-empowered applications: Personalisation via the technology acceptance model. In 2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI) (pp. 486–490). IEEE.
- Augmenting large language models with rules for enhanced domain-specific interactions: The case of medical diagnosis. Electronics, 13, 320.
- Explainable, trustworthy, and ethical machine learning for healthcare: A survey. Computers in Biology and Medicine, (p. 106043).
- Fine-tuning image transformers using learnable memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12155–12164).
- The use of artificial intelligence systems in diagnosis of pneumonia via signs and symptoms: A systematic review. Biomedical Signal Processing and Control, 72, 103325.
- The Internet Pathology Laboratory for Medical Education (). The internet pathology laboratory for medical education. https://webpath.med.utah.edu/webpath.html. Accessed: 2023-12-15.
- Assessment in undergraduate medical education: a review of course exams. Medical education online, 18, 20438.
- Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA (pp. 5998–6008). volume 30.
- Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 558–567).
- Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12104–12113).
- Dimitrios P. Panagoulias (4 papers)
- Maria Virvou (4 papers)
- George A. Tsihrintzis (4 papers)