Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 89 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 93 tok/s Pro

Kimi K2 221 tok/s Pro

GPT OSS 120B 457 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering (2406.06573v2)

Published 3 Jun 2024 in cs.CL and cs.LG

Abstract: LLMs (LLM) have achieved impressive performance on medical question-answering benchmarks. However, high benchmark accuracy does not imply that the performance generalizes to real-world clinical settings. Medical question-answering benchmarks rely on assumptions consistent with quantifying LLM performance but that may not hold in the open world of the clinic. Yet LLMs learn broad knowledge that can help the LLM generalize to practical conditions regardless of unrealistic assumptions in celebrated benchmarks. We seek to quantify how well LLM medical question-answering benchmark performance generalizes when benchmark assumptions are violated. Specifically, we present an adversarial method that we call MedFuzz (for medical fuzzing). MedFuzz attempts to modify benchmark questions in ways aimed at confounding the LLM. We demonstrate the approach by targeting strong assumptions about patient characteristics presented in the MedQA benchmark. Successful "attacks" modify a benchmark item in ways that would be unlikely to fool a medical expert but nonetheless "trick" the LLM into changing from a correct to an incorrect answer. Further, we present a permutation test technique that can ensure a successful attack is statistically significant. We show how to use performance on a "MedFuzzed" benchmark, as well as individual successful attacks. The methods show promise at providing insights into the ability of an LLM to operate robustly in more realistic settings.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces MedFuzz, an adversarial testing framework that systematically exposes LLM vulnerabilities in medical question answering.
It utilizes iterative attacks to modify benchmark questions with bias-inducing patient characteristics, revealing significant accuracy declines in models like GPT-4.
The study underscores the need for robust evaluation methods to ensure reliable LLM performance in real-world clinical decision support.

Overview of MedFuzz: Exploring the Robustness of LLMs in Medical Question Answering

The paper "MedFuzz: Exploring the Robustness of LLMs in Medical Question Answering," authored by researchers from Microsoft Research, MIT, Helivan Research, Johns Hopkins University, and others, critically examines the limitations of LLMs in medical question-answering by introducing MedFuzz, an adversarial testing method. The primary goal of MedFuzz is to evaluate whether the high benchmark performances of LLMs generalize to more realistic clinical environments, where the assumptions underlying benchmark datasets may not hold.

Research Context and Motivation

The success of LLMs in achieving near-human performance on medical question-answering benchmarks such as MedQA and USMLE has sparked interest in their potential use in clinical decision support. However, benchmarks often simplify complex real-world scenarios into structured multiple-choice formats, which might not fully capture the nuanced and unpredictable nature of clinical settings. The paper argues that high accuracy on these benchmarks may not directly translate into effective and reliable clinical performance.

Approach and Methodology

MedFuzz is an adversarial testing technique inspired by fuzzing in software testing, which introduces unexpected or random data to a system to find vulnerabilities. MedFuzz specifically focuses on violating assumptions made by medical question-answering benchmarks, aiming to confound LLMs in ways that would still yield correct answers from human experts but trick LLMs into erroneous responses. This method is particularly focused on testing how the LLM's performance deteriorates when benchmark questions are altered with misleading patient characteristics that appeal to social biases and stereotypes.

The process involves:

Targeting Assumptions: Identifying benchmark assumptions that do not generalize well to clinical settings.
Iterative Attacks: Using an attacker LLM to iteratively modify benchmark questions to introduce bias-driven distractors, while keeping the correct answer unchanged.
Performance Evaluation: Comparing the target LLM's performance on original vs. modified questions to assess robustness.
Significance Testing: Employing permutation tests to statistically validate the significance of individual successful attacks.

Experimental Setup and Results

The experiments were conducted on the MedQA dataset, focusing on GPT-4 and its predecessor GPT-3.5 as target LLMs. The attacker LLM was GPT-4, which iteratively modified benchmark questions by introducing additional patient characteristics irrelevant to clinical decision-making but likely to mislead the LLM due to social biases.

Key findings include:

Accuracy Decline: The LLMs' accuracy significantly declined when evaluated on MedFuzzed data, highlighting the impact of violated assumptions on performance.
Faithfulness of Explanations: A notable proportion of the LLMs' chain-of-thought explanations did not mention the misleading modifications, raising concerns about the reliability of LLM-generated rationales in clinical contexts.
Case Studies: Detailed case studies revealed insightful examples where patient characteristics like race, socioeconomic status, and criminal record heavily influenced the LLM's decision-making, leading to incorrect answers.

Implications and Future Work

The paper has both practical and theoretical implications:

Practical Implications: The findings stress the importance of not relying solely on benchmark performances for deploying LLMs in clinical settings. They underscore the need for robust evaluation frameworks that include adversarial testing to uncover and mitigate potential biases and failure modes.
Theoretical Implications: The paper contributes to the understanding of generalization limits in LLMs, particularly in high-stakes domains like medicine where biases can have severe consequences.

Future research directions include extending MedFuzz to other domains and fine-tuning models to improve robustness against adversarial attacks. Moreover, the methodology can be adapted for evaluating LLMs on other professional exams and subsequent real-world applications in their respective fields.

Conclusion

Overall, "MedFuzz: Exploring the Robustness of LLMs in Medical Question Answering" provides a comprehensive adversarial framework to test and illustrate the gaps in the generalizability of LLMs for clinical decision support. By introducing MedFuzz, the authors highlight the necessity of rigorous evaluation beyond traditional benchmarks, enabling the development of more reliable and ethically sound AI systems in healthcare and beyond.