MAIRA-2: Grounded Radiology Report Generation (2406.04449v2)

Published 6 Jun 2024 in cs.CL and cs.CV

Abstract: Radiology reporting is a complex task requiring detailed medical image understanding and precise language generation, for which generative multimodal models offer a promising solution. However, to impact clinical practice, models must achieve a high level of both verifiable performance and utility. We augment the utility of automated report generation by incorporating localisation of individual findings on the image - a task we call grounded report generation - and enhance performance by incorporating realistic reporting context as inputs. We design a novel evaluation framework (RadFact) leveraging the logical inference capabilities of LLMs to quantify report correctness and completeness at the level of individual sentences, while supporting the new task of grounded reporting. We develop MAIRA-2, a large radiology-specific multimodal model designed to generate chest X-ray reports with and without grounding. MAIRA-2 achieves state of the art on existing report generation benchmarks and establishes the novel task of grounded report generation.

PDF HTML Abstract

Overview of the Paper "MAIRA-2: Grounded Radiology Report Generation"

The paper "MAIRA-2: Grounded Radiology Report Generation" presents significant advancements in the field of AI-driven radiology by introducing a new approach for generating detailed, grounded radiology reports. Building on the complexity inherent in radiology reporting, the authors propose a model capable of not only generating free-text radiology reports but also localizing individual findings directly within the image, a task they term as "grounded report generation."

Key Contributions

Grounded Report Generation: The paper introduces the concept of grounded report generation. Unlike previous methods that generated reports based solely on textual descriptions, this task requires the model to also localize findings within the images. This involves generating bounding boxes for each described finding, allowing for precise localization and thereby increasing the transparency and utility of these automated reports.
Evaluation Framework - RadFact: In order to evaluate the grounded reports, the authors developed a novel evaluation framework named RadFact. This framework leverages the reasoning capabilities of LLMs to assess the factual accuracy of the generated sentences and the correctness of the spatial localizations. RadFact provides a meticulous evaluation method by considering logical entailment and spatial entailment of findings, thereby offering a comprehensive evaluation of both text and grounding quality.
MAIRA-2 Model: The proposed model, MAIRA-2, combines a radiology-specific image encoder with a LLM to handle the task of grounded report generation. MAIRA-2 goes beyond previous approaches by incorporating additional inputs such as the current frontal image, current lateral image, prior frontal image, and prior report, as well as specific sections from the current report (Indication, Technique, and Comparison). This comprehensive set of inputs allows the model to generate more accurate and clinically relevant reports.

Quantitative Results and Comparisons

MAIRA-2 establishes new benchmarks in radiology report generation. On tasks without grounding, the model sets a new state of the art in findings generation on the MIMIC-CXR dataset. Leveraging the broader set of inputs, including historical images and reports, significantly enhances the quality of generated reports and reduces hallucinations. The model demonstrates superior performance across various metrics:

RadFact Logical Precision: The logical precision for MAIRA-2 reaches 52.5%-55.6%, indicating the model's ability to generate findings that are logically consistent with the reference texts.
Clinical Metrics: The model achieves substantial improvements in RadGraph-F1 and CheXbert vector similarity, indicating enhanced clinical accuracy of the generated reports.
Phrase Grounding: Using the test split of the MS-CXR dataset, MAIRA-2 achieves competitive mean intersection over union (mIoU) scores, showcasing its robust capability in generating precise bounding boxes for findings.

Implications and Future Directions

The implications of this research extend across both practical and theoretical dimensions in the field of AI and radiology. Practically, the ability to generate grounded reports can significantly improve the efficiency and accuracy of radiology workflows. Grounded reports offer clarity in image interpretation, assist in diagnosis, and support non-radiologist clinicians in understanding radiological findings through visual localization.

From a theoretical perspective, the integration of comprehensive inputs and the use of LLMs for logical entailment verification exemplifies the synergy between multimodal learning and natural language processing. The RadFact framework represents a significant step forward in evaluating complex AI outputs, enabling more nuanced and accurate assessments of model performance.

Future developments in this area may include refining the grounding precision by exploring different model architectures or training paradigms that can better utilize the rich contextual information provided by prior studies and multiple view images. Additionally, extending the grounded reporting task to other imaging modalities and exploring its impact in real-world clinical settings can offer deeper insights and further advancements in automated medical reporting.

Conclusion:

The introduction of MAIRA-2 and the concept of grounded report generation represents a noteworthy advancement in AI for radiology. By addressing both text generation and spatial localization, the authors have set a new benchmark in the field, demonstrating the feasibility and utility of comprehensive, grounded radiology reporting. The RadFact framework ensures rigorous evaluation, paving the way for future innovations and clinical applications.