Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What Evidence Do Language Models Find Convincing? (2402.11782v2)

Published 19 Feb 2024 in cs.CL and cs.LG

Abstract: Retrieval-augmented LLMs are being increasingly tasked with subjective, contentious, and conflicting queries such as "is aspartame linked to cancer". To resolve these ambiguous queries, one must search through a large range of websites and consider "which, if any, of this evidence do I find convincing?". In this work, we study how LLMs answer this question. In particular, we construct ConflictingQA, a dataset that pairs controversial queries with a series of real-world evidence documents that contain different facts (e.g., quantitative results), argument styles (e.g., appeals to authority), and answers (Yes or No). We use this dataset to perform sensitivity and counterfactual analyses to explore which text features most affect LLM predictions. Overall, we find that current models rely heavily on the relevance of a website to the query, while largely ignoring stylistic features that humans find important such as whether a text contains scientific references or is written with a neutral tone. Taken together, these results highlight the importance of RAG corpus quality (e.g., the need to filter misinformation), and possibly even a shift in how LLMs are trained to better align with human judgements.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Sahar Abdelnabi and Mario Fritz. 2023. Fact-Saboteurs: A taxonomy of evidence manipulation attacks against fact-verification systems. In USENIX.
  2. Adept. 2022. ACT-1: Transformer for actions.
  3. GEO: Generative engine optimization. arXiv preprint arXiv:2311.09735.
  4. Trusted source alignment in large language models. arXiv preprint arXiv:2311.06697.
  5. Language models are few-shot learners. In NeurIPS.
  6. Daniel Bush and Alex Zaheer. 2019. Bing’s top search results contain an alarming amount of disinformation. Internet Observatory News.
  7. Reading Wikipedia to answer open-domain questions. In ACL.
  8. Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence. In EMNLP.
  9. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality.
  10. Synthetic disinformation attacks on automated fact verification systems. In AAAI.
  11. Ronen Eldan and Yuanzhi Li. 2023. Tinystories: How small can language models be and still speak coherent english?
  12. Andrew J. Flanagin and Miriam J. Metzger. 2000. Perceptions of internet information credibility. Journalism & Mass Communication Quarterly.
  13. How do users evaluate the credibility of web sites? A study with over 2,500 participants. In Designing for User Experiences.
  14. PAL: Program-aided language models. In ICML.
  15. Are you convinced? Choosing the more convincing evidence with a siamese network. In ACL.
  16. A large-scale dataset for argument quality ranking: Construction and analysis.
  17. Textbooks are all you need.
  18. Retrieval augmented language model pre-training. In ICML.
  19. Efficiently teaching an effective dense retriever with balanced topic aware sampling. In SIGIR.
  20. Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In EMNLP.
  21. How do humans assess the credibility on web blogs: Qualifying and verifying human factors with machine learning. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems.
  22. On the subjectivity and bias of web content credibility evaluations. In WWW.
  23. Understanding and predicting web content credibility using the content credibility corpus. Information Processing & Management.
  24. Dense passage retrieval for open-domain question answering. In EMNLP.
  25. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for Navy enlisted personnel. Technical report, Naval Technical Training Command Research Branch.
  26. Entity-based knowledge conflicts in question answering. In EMNLP.
  27. Social and heuristic approaches to credibility evaluation online. Journal of Communication.
  28. Augmented language models: A survey. In TMLR.
  29. AmbigQA: Answering ambiguous open-domain questions. In EMNLP.
  30. WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  31. Attacking open-domain question answering by injecting misinformation. In AACL.
  32. Language models are unsupervised multitask learners. OpenAI blog.
  33. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR.
  34. In-context retrieval-augmented language models. TACL.
  35. Sudha Rao and Hal Daumé III. 2018. Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information. In ACL.
  36. Toran Bruce Richards. 2023. AutoGPT. https://github.com/Significant-Gravitas/AutoGPT.
  37. Whose opinions do language models reflect? In ICML.
  38. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
  39. A brief review on search engine optimization. In Confluence.
  40. REPLUG: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652.
  41. Answering ambiguous questions with a database of questions, answers, and revisions. arXiv preprint arXiv:2308.08661.
  42. Automatic argument quality assessment – new datasets and methods.
  43. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  44. LLaMA 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  45. Finetuned language models are zero-shot learners. In ICLR.
  46. Adaptive chameleon or stubborn sloth: Unraveling the behavior of large language models in knowledge conflicts. arXiv preprint arXiv:2305.13300.
  47. WizardLM: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  48. Generating clarifying questions for information retrieval. The Web Conference.
  49. Michael JQ Zhang and Eunsol Choi. 2021. SituatedQA: Incorporating extra-linguistic contexts into QA. In EMNLP.
Citations (20)

Summary

  • The paper demonstrates that language models prioritize evidence relevance over stylistic features, diverging from human judgment.
  • It utilizes the ConflictingQA dataset, combining contentious queries with real-world evidence to measure LLM win-rates.
  • Perturbation experiments highlight that enhancing document relevance significantly improves LLM predictions, guiding training refinements.

Evaluating the Convincingness of Evidence in LLMs through ConflictingQA

Introduction to ConflictingQA

In the domain of retrieval-augmented LLMs, the ability to discern and select convincing evidence amid a plethora of conflicting information is paramount. This capability becomes increasingly significant when dealing with contentious and subjective queries such as the potential link between aspartame and cancer. The paper introduces ConflictingQA, a dataset designed to probe the criteria LLMs use to determine the convincingness of evidence documents. ConflictingQA pairs controversial queries with real-world evidence documents featuring diverse facts, argument styles, and conflicting answers, thereby setting the stage for a comprehensive analysis of LLM convincingness metrics.

Methodology

The creation of ConflictingQA involved several steps:

  • Generating Contentious Questions: Leveraging GPT-4, the authors generated a series of open-ended questions, ensuring diversity across topics while maintaining a binary response format (Yes or No) for simplicity.
  • Collecting Evidence Paragraphs: Real-world evidence paragraphs supporting both affirmative and negative responses to the generated questions were collected using the Google Search API. This process aimed to mirror the operational setting of retrieval-augmented LLMs.
  • Evaluating Convincingness: The dataset facilitated an evaluation of the degree to which an LLM's predictions align with the viewpoints of evidence documents, termed as the documents' win-rate.

Findings and Analysis

Sensitivity and Counterfactual Analyses

Sensitivity and counterfactual analyses were conducted to ascertain the impact of various text features on LLM predictions. These analyses highlighted that:

  • Relevance Over Stylistic Features: LLMs heavily favored the relevance of website evidence to the query over stylistic features such as neutrality of tone or inclusion of scientific references, which tend to be prioritized by humans.
  • Impact of Perturbations: Perturbations aiming to enhance document relevance significantly improved win-rates, while stylistic adjustments had neutral or negative effects.

These findings underscore a misalignment between LLM perceptions of convincingness and human judgment, emphasizing the tendency of LLMs to prioritize relevance.

Theoretical and Practical Implications

From a theoretical perspective, this research illuminates the nuanced ways in which LLMs process and evaluate ambiguous evidence, diverging from human reasoning patterns. Practically, the results call for a refurbishment of RAG corpus quality, potentially demanding a filtration mechanism to sieve out misinformation and a reconfiguration of LLM training methodologies to better resonate with human judgments. Additionally, the insights garnered about LLM convincingness preferences have profound implications for addressing misinformation, optimizing content for SEO, and advancing the dialogue around ethical considerations in AI-generated content.

Future Directions

This paper opens several avenues for future research in LLMs and generative AI, including:

  • Integrating Diverse Information Forms: Exploring the impact of incorporating metadata and visual content on LLM judgments.
  • Addressing Synthetic Texts: Considering the burgeoning volume of LLM-generated content on the web, understanding its influence on LLM evaluations of convincingness is crucial.
  • Ethical and Societal Implications: Delving deeper into the broader repercussions of how LLMs interpret and generate content, with an eye towards more ethically aligned methodologies.

Conclusion

In conclusion, the paper presents a critical examination of how LLMs appraise the convincingness of evidence, highlighting a preference for relevance over stylistic cues in contrast to human judgment. By leveraging the ConflictingQA dataset, the authors shed light on the necessity for enhanced corpus quality and a paradigm shift in LLM training to better align with human evaluations. This research not only contributes to refining the operational efficacy of LLMs but also stimulates a reflective discourse on the ethical dimensions of AI in navigating contentious information territories.

Youtube Logo Streamline Icon: https://streamlinehq.com