A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation (2401.12208v2)

Published 22 Jan 2024 in cs.CV and cs.CL

Abstract: Over 1.4 billion chest X-rays (CXRs) are performed annually due to their cost-effectiveness as an initial diagnostic test. This scale of radiological studies provides a significant opportunity to streamline CXR interpretation and documentation. While foundation models are a promising solution, the lack of publicly available large-scale datasets and benchmarks inhibits their iterative development and real-world evaluation. To overcome these challenges, we constructed a large-scale dataset (CheXinstruct), which we utilized to train a vision-language foundation model (CheXagent). We systematically demonstrated competitive performance across eight distinct task types on our novel evaluation benchmark (CheXbench). Beyond technical validation, we assessed the real-world utility of CheXagent in directly drafting radiology reports. Our clinical assessment with eight radiologists revealed a 36% time saving for residents using CheXagent-drafted reports, while attending radiologists showed no significant time difference editing resident-drafted or CheXagent-drafted reports. The CheXagent-drafted reports improved the writing efficiency of both radiology residents and attending radiologists in 81% and 61% of cases, respectively, without loss of quality. Overall, we demonstrate that CheXagent can effectively perform a variety of CXR interpretation tasks and holds potential to assist radiologists in routine clinical workflows.

PDF Abstract

Introduction

The prospects for automated Chest X-Ray (CXR) interpretation have significantly advanced with the advent of vision-language foundation models (FMs). Despite this progress, CXR interpretation remains a challenging area due to three primary hurdles: the scarcity of large-scale vision-language medical datasets, the limitations of current vision and language encoders in medical contexts, and the lack of comprehensive evaluation frameworks for benchmarking CXR interpretation abilities of FMs.

Foundation Model for CXR Interpretation

The CheXagent project directly addresses these challenges. At its core, it presents an instruction-tuned FM for CXR interpretation, leveraging a foundation dataset named CheXinstruct. CheXinstruct consists of paired instruction-image-answer triplets from diverse sources, aimed at enhancing an FM's capacity to understand and interpret CXRs. CheXagent itself integrates three components: a clinical LLM trained for parsing radiology reports, a vision encoder to represent CXR images, and a bridging network uniting vision and language modalities.

CheXbench: Systematic Evaluation

With the development of CheXbench, the research team offers a systematic benchmark to evaluate an FM's efficacy across eight clinically-relevant CXR tasks. These tasks delve into aspects of image perception and textual understanding, assessed through tasks like view classification, disease identification, and visual question-answering. The thorough evaluation reveals CheXagent's superior performance over existing general-domain and medical-domain FMs, underscoring its robust capability in CXR interpretation tasks.

Fairness and Future Directions

An important aspect of model development, especially in healthcare, is ensuring equitable performance across demographic groups. The CheXagent team conducted an in-depth fairness evaluation considering factors such as sex, race, and age and highlighted areas where performance disparities exist. This objective analysis further informs enhancements to the model’s transparency and unbiased application.

In summary, the development and assessment of CheXagent encompass a significant leap toward a sophisticated FM for radiology. The integration of CheXinstruct and CheXbench allows for training and testing of a model that demonstrates substantial improvements in CXR interpretation, bolstered by rigorous analysis through expert radiologist evaluations and fairness assessments. With these tools and datasets now publicly available, the work paves the way for further advancements in AI-powered medical image interpretation.