Exploring Multimodal Large Language Models for Radiology Report Error-checking (2312.13103v2)
Abstract: This paper proposes one of the first clinical applications of multimodal LLMs as an assistant for radiologists to check errors in their reports. We created an evaluation dataset from real-world radiology datasets (including X-rays and CT scans). A subset of original reports was modified to contain synthetic errors by introducing three types of mistakes: "insert", "remove", and "substitute". The evaluation contained two difficulty levels: SIMPLE for binary error-checking and COMPLEX for identifying error types. At the SIMPLE level, our fine-tuned model significantly enhanced performance by 47.4% and 25.4% on MIMIC-CXR and IU X-ray data, respectively. This performance boost is also observed in unseen modality, CT scans, as the model performed 19.46% better than the baseline model. The model also surpassed the domain expert's accuracy in the MIMIC-CXR dataset by 1.67%. Notably, among the subsets (N=21) of the test set where a clinician did not achieve the correct conclusion, the LLaVA ensemble mode correctly identified 71.4% of these cases. However, all models performed poorly in identifying mistake types, underscoring the difficulty of the COMPLEX level. This study marks a promising step toward utilizing multimodal LLMs to enhance diagnostic accuracy in radiology. The ensemble model demonstrated comparable performance to clinicians, even capturing errors overlooked by humans.
- Jinge Wu (18 papers)
- Yunsoo Kim (12 papers)
- Eva C. Keller (1 paper)
- Jamie Chow (1 paper)
- Adam P. Levine (2 papers)
- Nikolas Pontikos (1 paper)
- Zina Ibrahim (17 papers)
- Paul Taylor (10 papers)
- Michelle C. Williams (25 papers)
- Honghan Wu (33 papers)