Dr ChatGPT, tell me what I want to hear: How prompt knowledge impacts health answer correctness (2302.13793v1)

Published 23 Feb 2023 in cs.CL, cs.AI, and cs.IR

Abstract: Generative pre-trained LLMs (GPLMs) like ChatGPT encode in the model's parameters knowledge the models observe during the pre-training phase. This knowledge is then used at inference to address the task specified by the user in their prompt. For example, for the question-answering task, the GPLMs leverage the knowledge and linguistic patterns learned at training to produce an answer to a user question. Aside from the knowledge encoded in the model itself, answers produced by GPLMs can also leverage knowledge provided in the prompts. For example, a GPLM can be integrated into a retrieve-then-generate paradigm where a search engine is used to retrieve documents relevant to the question; the content of the documents is then transferred to the GPLM via the prompt. In this paper we study the differences in answer correctness generated by ChatGPT when leveraging the model's knowledge alone vs. in combination with the prompt knowledge. We study this in the context of consumers seeking health advice from the model. Aside from measuring the effectiveness of ChatGPT in this context, we show that the knowledge passed in the prompt can overturn the knowledge encoded in the model and this is, in our experiments, to the detriment of answer correctness. This work has important implications for the development of more robust and transparent question-answering systems based on generative pre-trained LLMs.

Citations (42)

View on Semantic Scholar

Summary

The paper demonstrates that adding external evidence to ChatGPT prompts reduces the accuracy of health responses from 80% to 63%.
It employs two experimental paradigms—question-only and evidence-biased—to uncover the influence of prompt formulation on answer correctness.
The study underscores the need for improved evidence integration strategies in retrieval-augmented systems for reliable health information.

The paper examines how prompt-provided knowledge interacts with the knowledge stored in a generative pre-trained LLM, specifically ChatGPT (Chat Generative Pre-trained Transformer), in the context of health information question answering. The paper focuses on questions about treatment efficacy derived from the TREC Health Misinformation track and is organized around two primary research questions.

Methodological Approach

The investigation is split into two experimental paradigms:

RQ1 – General Effectiveness:
- ChatGPT is prompted with only the natural language question.
- The questions, drawn from 100 topics, ask about whether treatment X positively influences condition Y using a simplified "Yes/No" prompt format.
- ChatGPT’s responses, accompanied by an explanation, are compared against a ground truth based on current medical practice.
- The analysis shows an overall accuracy of 80%, with similar error rates for both positive and negative stances.
RQ2 – Evidence Biased Effectiveness:
- In this scenario, ChatGPT is provided with an additional segment in the prompt—a passage taken from web search results.
- The additional evidence is categorized into two types: supporting evidence (judged as “Supportive”) and contrary evidence (judged as “Dissuades”).
- For each TREC Health Misinformation topic (using 35 topics with available document-level assessments), a maximum of three passages per evidence type were integrated into the prompt alongside the original question.
- Notably, while the prompt instructs a binary “Yes/No” answer, ChatGPT sometimes issues a variant answer that requires manual interpretation. The final evaluation, aggregating 177 responses, indicates a reduced accuracy of 63%.
- A detailed analysis using diagrams shows that when ChatGPT flips its answer upon receiving evidence, the switch is typically detrimental to answer correctness. In several instances, correct answers in the question-only condition are overturned when conflicting evidence is provided.

Technical Observations

Prompt Integration and Stochastic Generation:
- The paper underscores the significant impact of prompt formulation on the model’s output. Evidence provided via the prompt can override the inherent knowledge encoded during pre-training, leading to answer flips.
- Detailed tokenization considerations are provided; for instance, the evidence passages are trimmed to a maximum of 2,200 tokens (using NLTK’s word_tokenize) to fit within ChatGPT’s input constraints.
Numerical Findings:
- Accuracy drops from 80% (question-only) to 63% (evidence-biased), illustrating that additional, possibly biased, external evidence can markedly affect model performance.
- The Sankey diagram presented in the paper visually details the transition of responses between the two prompting conditions—with roughly half the errors originating from cases where the evidence steers correct question-only responses into incorrect answers.
Implications for Retrieve-Then-Generate Pipelines:
- The findings highlight a significant consideration for systems that rely on external evidence retrieval to augment model answers. The experimentation suggests that, particularly within the health domain, the correctness and reliability of the evidence play a critical role in determining the final output quality.
- The ability of supporting evidence to potentially override the model’s prior correctness calls for improved strategies to calibrate the integration of prompt knowledge in such pipelines.

Limitations and Future Work

The authors acknowledge the stochastic nature of ChatGPT’s outputs but do not analyze variability across multiple runs of the same questions.
The paper refrains from an in-depth analysis of the model-provided explanations, particularly regarding whether the factual claims are accurate or hallucinatory—a concern noted in related literature on hallucinations in natural language generation.
Multi-turn conversation capabilities were not leveraged even though such interactions might allow for iterative correction or clarification, which could mitigate some of the negative impacts of evidence bias.
Further work is recommended to analyze the attributes of evidence passages that most strongly influence answer flipping and to explore alternative prompt formulations that could balance model knowledge and external evidence more effectively.

Conclusion

The paper provides a meticulous empirical paper of how prompt-provided evidence modulates ChatGPT responses in a health information retrieval context. The core contribution is the finding that external evidence—even when well-structured—can adversely affect answer correctness by overriding the model’s internal knowledge, reducing accuracy from 80% to 63%. This insight is particularly pertinent for the design of reliable and robust question-answering systems in domains where information accuracy is paramount.

PDF Markdown

Dr ChatGPT, tell me what I want to hear: How prompt knowledge impacts health answer correctness (2302.13793v1)

Summary

Related Papers