Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Preference Fine-Tuning for Factuality in Chest X-Ray Interpretation Models Without Human Feedback (2410.07025v1)

Published 9 Oct 2024 in cs.CV and cs.CL

Abstract: Radiologists play a crucial role by translating medical images into medical reports. However, the field faces staffing shortages and increasing workloads. While automated approaches using vision-LLMs (VLMs) show promise as assistants, they require exceptionally high accuracy. Most current VLMs in radiology rely solely on supervised fine-tuning (SFT). Meanwhile, in the general domain, additional preference fine-tuning has become standard practice. The challenge in radiology lies in the prohibitive cost of obtaining radiologist feedback. We propose a scalable automated preference alignment technique for VLMs in radiology, focusing on chest X-ray (CXR) report generation. Our method leverages publicly available datasets with an LLM-as-a-Judge mechanism, eliminating the need for additional expert radiologist feedback. We evaluate and benchmark five direct alignment algorithms (DAAs). Our results show up to a 57.4% improvement in average GREEN scores, a LLM-based metric for evaluating CXR reports, and a 9.2% increase in an average across six metrics (domain specific and general), compared to the SFT baseline. We study reward overoptimization via length exploitation, with reports lengthening by up to 3.2x. To assess a potential alignment tax, we benchmark on six additional diverse tasks, finding no significant degradations. A reader study involving four board-certified radiologists indicates win rates of up to 0.62 over the SFT baseline, while significantly penalizing verbosity. Our analysis provides actionable insights for the development of VLMs in high-stakes fields like radiology.

Preference Fine-Tuning for Factuality in Chest X-Ray Interpretation Models Without Human Feedback

The paper "Preference Fine-Tuning for Factuality in Chest X-Ray Interpretation Models Without Human Feedback" presents a methodical approach to enhancing vision-LLMs (VLMs) for chest X-ray (CXR) report generation by preference fine-tuning without requiring direct radiologist feedback. This work addresses a pertinent challenge in radiology: the balance between the high demand for accurate automated interpretation and the constraints of limited expert availability for model feedback.

Context and Approach

Radiology has rapidly integrated automated approaches due to imaging frequency and complexity. Chest X-rays, a fundamental diagnostic tool, exacerbate radiologist workload due to their volume and the critical nature of timely, accurate interpretations. Existing VLMs, primarily using supervised fine-tuning (SFT), show promise but are limited in addressing hallucinations—erroneous content not grounded in the image data. Drawing upon methods emerging in general vision-language research, preference fine-tuning offers a solution by aligning model outputs to predefined standards without extensive human input.

The authors propose using publicly available CXR datasets with an innovative LLM-as-a-Judge mechanism to automate preference alignment. This circumvents the typical need for costly radiologist feedback by leveraging a scalable, automated evaluation process using pretrained LLMs designed for this task.

Key Contributions and Results

The paper advances several pivotal areas within medical imaging AI:

  1. Automated Preference Data Collection: Publicly available datasets with reference reports are used to implement an LLM-as-a-Judge mechanism, specifically the GREEN metric, to assess the factuality of generated CXR reports. This method maintains high-quality preference datasets in a scalable manner.
  2. Evaluation of Direct Alignment Algorithms (DAAs): Five representative DAAs—DPO, KTO, IPO, SimPO, and ORPO—are systematically evaluated. The findings highlight significant improvements over SFT baselines, notably with up to 57.4% enhancement in GREEN scores on MIMIC-CXR and CheXpert Plus datasets.
  3. Addressing Reward Overoptimization: The authors examine report length exploitation, observing verbosity bias associated with reward overoptimization. DPO, in particular, lengthens reports considerably, underscoring the need for explicit regularization to maintain practical usability.
  4. Assessment of Alignment Tax: The paper finds no significant degradations across six additional diverse tasks, demonstrating robustness and mitigating concerns about the alignment tax potentially degrading performance on unrelated tasks.
  5. Clinical Input: A reader paper involving radiologists indicates preferences for models with less verbose outputs, ultimately favoring models that align closely with clinical utility, as shown by ORPO's win rate of 0.62 over SFT.

Implications and Future Directions

The paper's methodology holds substantial promise for the development of AI in high-stakes, low-data medical domains. Automated preference fine-tuning aligns VLMs closer to the accuracy demanded in clinical settings without the logistical dependency on extensive expert annotations.

The implications are twofold: practically, this method can enhance clinical reporting efficiency, potentially alleviating workforce constraints; theoretically, it pushes the border of feasible automation in complex domains like healthcare with minimal human oversight. Future developments might explore further optimization of DAAs to manage verbosity and assess this framework's adaptability to other imaging modalities and clinical tasks.

In summary, this research underscores the vital intersection of AI and healthcare, offering actionable insights for enhancing factual accuracy in automated medical interpretation systems, paramount for advancing AI-driven radiology.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Dennis Hein (6 papers)
  2. Zhihong Chen (63 papers)
  3. Sophie Ostmeier (11 papers)
  4. Justin Xu (7 papers)
  5. Maya Varma (17 papers)
  6. Eduardo Pontes Reis (8 papers)
  7. Arne Edward Michalson (2 papers)
  8. Christian Bluethgen (20 papers)
  9. Hyun Joo Shin (1 paper)
  10. Curtis Langlotz (24 papers)
  11. Akshay S Chaudhari (5 papers)
Citations (1)