Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Well Do Multi-modal LLMs Interpret CT Scans? An Auto-Evaluation Framework for Analyses (2403.05680v2)

Published 8 Mar 2024 in cs.AI, cs.CL, and cs.CV

Abstract: Automatically interpreting CT scans can ease the workload of radiologists. However, this is challenging mainly due to the scarcity of adequate datasets and reference standards for evaluation. This study aims to bridge this gap by introducing a novel evaluation framework, named ``GPTRadScore''. This framework assesses the capabilities of multi-modal LLMs, such as GPT-4 with Vision (GPT-4V), Gemini Pro Vision, LLaVA-Med, and RadFM, in generating descriptions for prospectively-identified findings. By employing a decomposition technique based on GPT-4, GPTRadScore compares these generated descriptions with gold-standard report sentences, analyzing their accuracy in terms of body part, location, and type of finding. Evaluations demonstrated a high correlation with clinician assessments and highlighted its potential over traditional metrics, such as BLEU, METEOR, and ROUGE. Furthermore, to contribute to future studies, we plan to release a benchmark dataset annotated by clinicians. Using GPTRadScore, we found that while GPT-4V and Gemini Pro Vision fare better, their performance revealed significant areas for improvement, primarily due to limitations in the dataset used for training these models. To demonstrate this potential, RadFM was fine-tuned and it resulted in significant accuracy improvements: location accuracy rose from 3.41\% to 12.8\%, body part accuracy from 29.12\% to 53\%, and type accuracy from 9.24\% to 30\%, thereby validating our hypothesis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Signing a colleague’s radiology report. American Journal of Roentgenology, 176(1):27–30, 2001. PMID: 11133532.
  2. Syntactic and semantic errors in radiology reports associated with speech recognition software. Health Informatics Journal, 23(1):3–13, 2017. PMID: 26635322.
  3. Patient exposure from radiologic and nuclear medicine procedures in the united states and worldwide: 2009-2018. Radiology, 301, 2023.
  4. Incidence and factors associated with burnout in radiologists: A systematic review. European journal of radiology open, 11:100530, 2023.
  5. Generating radiology reports via memory-driven transformer. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1439–1449, Online, November 2020. Association for Computational Linguistics.
  6. Cross-modal memory networks for radiology report generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5904–5914, Online, August 2021. Association for Computational Linguistics.
  7. Visual grounding of whole radiology reports for 3d ct images. In Hayit Greenspan, Anant Madabhushi, Parvin Mousavi, Septimiu Salcudean, James Duncan, Tanveer Syeda-Mahmood, and Russell Taylor, editors, Medical Image Computing and Computer Assisted Intervention – MICCAI 2023, pages 611–621, Cham, 2023. Springer Nature Switzerland.
  8. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  9. LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36, 2024.
  10. Towards generalist foundation model for radiology. arXiv preprint arXiv:2308.02463, 2023.
  11. Opportunities and challenges for chatgpt and large language models in biomedicine and health. Briefings in Bioinformatics, 25(1):bbad493, 2024.
  12. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023.
  13. Retrieve, summarize, and verify: How will chatgpt impact information seeking from the medical literature? Journal of the American Society of Nephrology, pages 10–1681, 2023.
  14. Utilizing longitudinal chest x-rays and reports to pre-fill radiology reports. In MICCAI (5), volume 14224 of Lecture Notes in Computer Science, pages 189–198. Springer, 2023.
  15. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Thirty-Third AAAI Conference on Artificial Intelligence, 2019.
  16. Leveraging professional radiologists’ expertise to enhance llms’ evaluation for radiology reports, 2024.
  17. Hidden flaws behind expert-level accuracy of gpt-4 vision in medicine. arXiv preprint arXiv:2401.08396, 2024.
  18. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106, 2017.
  19. Negbio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA Summits on Translational Science Proceedings, 2018:188, 2018.
  20. G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, December 2023. Association for Computational Linguistics.
  21. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  22. Deeplesion: Automated deep mining, categorization and detection of significant radiology image findings using large-scale clinical lesion annotations. CoRR, abs/1710.01766, 2017.
  23. Holistic and comprehensive annotation of clinically significant findings on diverse ct images: learning from radiology reports and label ontology. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8523–8532, 2019.
  24. OpenAI. GPT-4V(ision) System Card. https://cdn.openai.com/papers/GPTV\_System\_Card.pdf, 2023.
  25. The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Qingqing Zhu (16 papers)
  2. Benjamin Hou (31 papers)
  3. Tejas S. Mathai (1 paper)
  4. Pritam Mukherjee (20 papers)
  5. Qiao Jin (74 papers)
  6. Xiuying Chen (80 papers)
  7. Zhizheng Wang (10 papers)
  8. Ruida Cheng (2 papers)
  9. Ronald M. Summers (111 papers)
  10. Zhiyong Lu (113 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com