- The paper demonstrates that adding external evidence to ChatGPT prompts reduces the accuracy of health responses from 80% to 63%.
- It employs two experimental paradigms—question-only and evidence-biased—to uncover the influence of prompt formulation on answer correctness.
- The study underscores the need for improved evidence integration strategies in retrieval-augmented systems for reliable health information.
The paper examines how prompt-provided knowledge interacts with the knowledge stored in a generative pre-trained LLM, specifically ChatGPT (Chat Generative Pre-trained Transformer), in the context of health information question answering. The paper focuses on questions about treatment efficacy derived from the TREC Health Misinformation track and is organized around two primary research questions.
Methodological Approach
The investigation is split into two experimental paradigms:
- RQ1 – General Effectiveness:
- ChatGPT is prompted with only the natural language question.
- The questions, drawn from 100 topics, ask about whether treatment X positively influences condition Y using a simplified "Yes/No" prompt format.
- ChatGPT’s responses, accompanied by an explanation, are compared against a ground truth based on current medical practice.
- The analysis shows an overall accuracy of 80%, with similar error rates for both positive and negative stances.
- RQ2 – Evidence Biased Effectiveness:
- In this scenario, ChatGPT is provided with an additional segment in the prompt—a passage taken from web search results.
- The additional evidence is categorized into two types: supporting evidence (judged as “Supportive”) and contrary evidence (judged as “Dissuades”).
- For each TREC Health Misinformation topic (using 35 topics with available document-level assessments), a maximum of three passages per evidence type were integrated into the prompt alongside the original question.
- Notably, while the prompt instructs a binary “Yes/No” answer, ChatGPT sometimes issues a variant answer that requires manual interpretation. The final evaluation, aggregating 177 responses, indicates a reduced accuracy of 63%.
- A detailed analysis using diagrams shows that when ChatGPT flips its answer upon receiving evidence, the switch is typically detrimental to answer correctness. In several instances, correct answers in the question-only condition are overturned when conflicting evidence is provided.
Technical Observations
- Prompt Integration and Stochastic Generation:
- The paper underscores the significant impact of prompt formulation on the model’s output. Evidence provided via the prompt can override the inherent knowledge encoded during pre-training, leading to answer flips.
- Detailed tokenization considerations are provided; for instance, the evidence passages are trimmed to a maximum of 2,200 tokens (using NLTK’s word_tokenize) to fit within ChatGPT’s input constraints.
- Numerical Findings:
- Accuracy drops from 80% (question-only) to 63% (evidence-biased), illustrating that additional, possibly biased, external evidence can markedly affect model performance.
- The Sankey diagram presented in the paper visually details the transition of responses between the two prompting conditions—with roughly half the errors originating from cases where the evidence steers correct question-only responses into incorrect answers.
- Implications for Retrieve-Then-Generate Pipelines:
- The findings highlight a significant consideration for systems that rely on external evidence retrieval to augment model answers. The experimentation suggests that, particularly within the health domain, the correctness and reliability of the evidence play a critical role in determining the final output quality.
- The ability of supporting evidence to potentially override the model’s prior correctness calls for improved strategies to calibrate the integration of prompt knowledge in such pipelines.
Limitations and Future Work
- The authors acknowledge the stochastic nature of ChatGPT’s outputs but do not analyze variability across multiple runs of the same questions.
- The paper refrains from an in-depth analysis of the model-provided explanations, particularly regarding whether the factual claims are accurate or hallucinatory—a concern noted in related literature on hallucinations in natural language generation.
- Multi-turn conversation capabilities were not leveraged even though such interactions might allow for iterative correction or clarification, which could mitigate some of the negative impacts of evidence bias.
- Further work is recommended to analyze the attributes of evidence passages that most strongly influence answer flipping and to explore alternative prompt formulations that could balance model knowledge and external evidence more effectively.
Conclusion
The paper provides a meticulous empirical paper of how prompt-provided evidence modulates ChatGPT responses in a health information retrieval context. The core contribution is the finding that external evidence—even when well-structured—can adversely affect answer correctness by overriding the model’s internal knowledge, reducing accuracy from 80% to 63%. This insight is particularly pertinent for the design of reliable and robust question-answering systems in domains where information accuracy is paramount.