What Evidence Do Language Models Find Convincing? (2402.11782v2)

Published 19 Feb 2024 in cs.CL and cs.LG

Abstract: Retrieval-augmented LLMs are being increasingly tasked with subjective, contentious, and conflicting queries such as "is aspartame linked to cancer". To resolve these ambiguous queries, one must search through a large range of websites and consider "which, if any, of this evidence do I find convincing?". In this work, we study how LLMs answer this question. In particular, we construct ConflictingQA, a dataset that pairs controversial queries with a series of real-world evidence documents that contain different facts (e.g., quantitative results), argument styles (e.g., appeals to authority), and answers (Yes or No). We use this dataset to perform sensitivity and counterfactual analyses to explore which text features most affect LLM predictions. Overall, we find that current models rely heavily on the relevance of a website to the query, while largely ignoring stylistic features that humans find important such as whether a text contains scientific references or is written with a neutral tone. Taken together, these results highlight the importance of RAG corpus quality (e.g., the need to filter misinformation), and possibly even a shift in how LLMs are trained to better align with human judgements.

References (49)

Citations (20)

View on Semantic Scholar

Summary

The paper demonstrates that language models prioritize evidence relevance over stylistic features, diverging from human judgment.
It utilizes the ConflictingQA dataset, combining contentious queries with real-world evidence to measure LLM win-rates.
Perturbation experiments highlight that enhancing document relevance significantly improves LLM predictions, guiding training refinements.

Evaluating the Convincingness of Evidence in LLMs through ConflictingQA

Introduction to ConflictingQA

In the domain of retrieval-augmented LLMs, the ability to discern and select convincing evidence amid a plethora of conflicting information is paramount. This capability becomes increasingly significant when dealing with contentious and subjective queries such as the potential link between aspartame and cancer. The paper introduces ConflictingQA, a dataset designed to probe the criteria LLMs use to determine the convincingness of evidence documents. ConflictingQA pairs controversial queries with real-world evidence documents featuring diverse facts, argument styles, and conflicting answers, thereby setting the stage for a comprehensive analysis of LLM convincingness metrics.

Methodology

The creation of ConflictingQA involved several steps:

Generating Contentious Questions: Leveraging GPT-4, the authors generated a series of open-ended questions, ensuring diversity across topics while maintaining a binary response format (Yes or No) for simplicity.
Collecting Evidence Paragraphs: Real-world evidence paragraphs supporting both affirmative and negative responses to the generated questions were collected using the Google Search API. This process aimed to mirror the operational setting of retrieval-augmented LLMs.
Evaluating Convincingness: The dataset facilitated an evaluation of the degree to which an LLM's predictions align with the viewpoints of evidence documents, termed as the documents' win-rate.

Findings and Analysis

Sensitivity and Counterfactual Analyses

Sensitivity and counterfactual analyses were conducted to ascertain the impact of various text features on LLM predictions. These analyses highlighted that:

Relevance Over Stylistic Features: LLMs heavily favored the relevance of website evidence to the query over stylistic features such as neutrality of tone or inclusion of scientific references, which tend to be prioritized by humans.
Impact of Perturbations: Perturbations aiming to enhance document relevance significantly improved win-rates, while stylistic adjustments had neutral or negative effects.

These findings underscore a misalignment between LLM perceptions of convincingness and human judgment, emphasizing the tendency of LLMs to prioritize relevance.

Theoretical and Practical Implications

From a theoretical perspective, this research illuminates the nuanced ways in which LLMs process and evaluate ambiguous evidence, diverging from human reasoning patterns. Practically, the results call for a refurbishment of RAG corpus quality, potentially demanding a filtration mechanism to sieve out misinformation and a reconfiguration of LLM training methodologies to better resonate with human judgments. Additionally, the insights garnered about LLM convincingness preferences have profound implications for addressing misinformation, optimizing content for SEO, and advancing the dialogue around ethical considerations in AI-generated content.

Future Directions

This paper opens several avenues for future research in LLMs and generative AI, including:

Integrating Diverse Information Forms: Exploring the impact of incorporating metadata and visual content on LLM judgments.
Addressing Synthetic Texts: Considering the burgeoning volume of LLM-generated content on the web, understanding its influence on LLM evaluations of convincingness is crucial.
Ethical and Societal Implications: Delving deeper into the broader repercussions of how LLMs interpret and generate content, with an eye towards more ethically aligned methodologies.

Conclusion

In conclusion, the paper presents a critical examination of how LLMs appraise the convincingness of evidence, highlighting a preference for relevance over stylistic cues in contrast to human judgment. By leveraging the ConflictingQA dataset, the authors shed light on the necessity for enhanced corpus quality and a paradigm shift in LLM training to better align with human evaluations. This research not only contributes to refining the operational efficacy of LLMs but also stimulates a reflective discourse on the ethical dimensions of AI in navigating contentious information territories.

PDF Markdown

Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1759782425310208335

https://twitter.com/alexwan55/status/1760433298197942740

https://twitter.com/chris_bail/status/1769713202454208603

https://twitter.com/fly51fly/status/1760073866809114767

https://twitter.com/IntuitMachine/status/1760972763488014444

https://twitter.com/imabit_inc/status/1828957546662551896

YouTube

Show All Videos