Preference Fine-Tuning for Factuality in Chest X-Ray Interpretation Models Without Human Feedback
The paper "Preference Fine-Tuning for Factuality in Chest X-Ray Interpretation Models Without Human Feedback" presents a methodical approach to enhancing vision-LLMs (VLMs) for chest X-ray (CXR) report generation by preference fine-tuning without requiring direct radiologist feedback. This work addresses a pertinent challenge in radiology: the balance between the high demand for accurate automated interpretation and the constraints of limited expert availability for model feedback.
Context and Approach
Radiology has rapidly integrated automated approaches due to imaging frequency and complexity. Chest X-rays, a fundamental diagnostic tool, exacerbate radiologist workload due to their volume and the critical nature of timely, accurate interpretations. Existing VLMs, primarily using supervised fine-tuning (SFT), show promise but are limited in addressing hallucinations—erroneous content not grounded in the image data. Drawing upon methods emerging in general vision-language research, preference fine-tuning offers a solution by aligning model outputs to predefined standards without extensive human input.
The authors propose using publicly available CXR datasets with an innovative LLM-as-a-Judge mechanism to automate preference alignment. This circumvents the typical need for costly radiologist feedback by leveraging a scalable, automated evaluation process using pretrained LLMs designed for this task.
Key Contributions and Results
The paper advances several pivotal areas within medical imaging AI:
- Automated Preference Data Collection: Publicly available datasets with reference reports are used to implement an LLM-as-a-Judge mechanism, specifically the GREEN metric, to assess the factuality of generated CXR reports. This method maintains high-quality preference datasets in a scalable manner.
- Evaluation of Direct Alignment Algorithms (DAAs): Five representative DAAs—DPO, KTO, IPO, SimPO, and ORPO—are systematically evaluated. The findings highlight significant improvements over SFT baselines, notably with up to 57.4% enhancement in GREEN scores on MIMIC-CXR and CheXpert Plus datasets.
- Addressing Reward Overoptimization: The authors examine report length exploitation, observing verbosity bias associated with reward overoptimization. DPO, in particular, lengthens reports considerably, underscoring the need for explicit regularization to maintain practical usability.
- Assessment of Alignment Tax: The paper finds no significant degradations across six additional diverse tasks, demonstrating robustness and mitigating concerns about the alignment tax potentially degrading performance on unrelated tasks.
- Clinical Input: A reader paper involving radiologists indicates preferences for models with less verbose outputs, ultimately favoring models that align closely with clinical utility, as shown by ORPO's win rate of 0.62 over SFT.
Implications and Future Directions
The paper's methodology holds substantial promise for the development of AI in high-stakes, low-data medical domains. Automated preference fine-tuning aligns VLMs closer to the accuracy demanded in clinical settings without the logistical dependency on extensive expert annotations.
The implications are twofold: practically, this method can enhance clinical reporting efficiency, potentially alleviating workforce constraints; theoretically, it pushes the border of feasible automation in complex domains like healthcare with minimal human oversight. Future developments might explore further optimization of DAAs to manage verbosity and assess this framework's adaptability to other imaging modalities and clinical tasks.
In summary, this research underscores the vital intersection of AI and healthcare, offering actionable insights for enhancing factual accuracy in automated medical interpretation systems, paramount for advancing AI-driven radiology.