Evaluating Verifiability in Generative Search Engines (2304.09848v2)

Published 19 Apr 2023 in cs.CL and cs.IR

Abstract: Generative search engines directly generate responses to user queries, along with in-line citations. A prerequisite trait of a trustworthy generative search engine is verifiability, i.e., systems should cite comprehensively (high citation recall; all statements are fully supported by citations) and accurately (high citation precision; every cite supports its associated statement). We conduct human evaluation to audit four popular generative search engines -- Bing Chat, NeevaAI, perplexity.ai, and YouChat -- across a diverse set of queries from a variety of sources (e.g., historical Google user queries, dynamically-collected open-ended questions on Reddit, etc.). We find that responses from existing generative search engines are fluent and appear informative, but frequently contain unsupported statements and inaccurate citations: on average, a mere 51.5% of generated sentences are fully supported by citations and only 74.5% of citations support their associated sentence. We believe that these results are concerningly low for systems that may serve as a primary tool for information-seeking users, especially given their facade of trustworthiness. We hope that our results further motivate the development of trustworthy generative search engines and help researchers and users better understand the shortcomings of existing commercial systems.

PDF Abstract

Evaluating Verifiability in Generative Search Engines

This paper presents a comprehensive evaluation of the verifiability of popular generative search engines, specifically focusing on Bing Chat, NeevaAI, perplexity.ai, and YouChat. The researchers investigate the extent to which these systems provide responses that are fully supported by citations, ensuring their reliability as information sources. The paper employs a detailed human evaluation protocol across a diverse array of queries derived from various sources, such as historical Google user queries and open-ended Reddit questions, in order to ascertain the systems' citation recall and precision.

The core premise of this paper is that verifiability is an essential characteristic of trustworthy generative search systems, where verifiability is defined as the capability of the system to provide citations that comprehensively and accurately support every generated statement. This entails a high citation recall, where all statements are supported by citations, and high citation precision, ensuring that all citations are relevant to their corresponding statements.

The researchers conducted their evaluations with 1450 queries per system, measuring several dimensions, including fluency, perceived utility, citation recall, and citation precision. Notably, the findings reveal that while the evaluated generative search engines often produce responses that are fluent and appear informative, they frequently contain unsupported statements and citations, rendering them potentially misleading.

Statistical results indicate that only 51.5% of the generated statements are fully supported by citations, with citation precision averaging at 74.5%. The discrepancy between a perceived trustworthiness (derived from high fluency scores) and the actual accuracy of citations suggests an inherent risk for misinformation among users who might not verify citations proactively. The correlation between perceived utility and citation precision demonstrates an inverse relationship, with lower precision often correlating with higher perceived utility, possibly due to paraphrasing and slight deviations from source content by some systems.

The investigation into existing systems highlights differences in citation recall and precision across the four evaluated search engines and explores potential causes for these discrepancies. For instance, Bing Chat shows the highest citation precision but at the expense of perceived utility, likely because of its tendency to closely follow the cited articles. Conversely, YouChat displays the lowest citation precision, but its responses are perceived as more useful, indicating a greater degree of abstraction and creative synthesis.

The implications of this work are significant for both researchers and practitioners involved in the development and deployment of generative models. From a theoretical perspective, the paper contributes to understanding the trade-offs associated with designing systems that balance factuality, relevance, and user engagement. In practical terms, these insights underscore the critical need for enhanced citation mechanisms that do not merely improve accuracy but also maintain the perceived utility without compromising factual correctness.

The analysis exposes the fine line between the effective distribution of correct information and the perpetual risk of misinformation, invoking a call to action for further research. Future developments could focus on nuanced strategies to mitigate the identified shortcomings, such as integrating more robust sourcing techniques or developing metrics that more precisely capture the intricacies of meaningful content generation coupled with reliable references.

In summary, the paper delivers an in-depth assessment of current generative search engines and their handling of verifiable content, offering a foundation for improved design methodologies in generative information retrieval systems. It emphasizes the urgency for the academic and industrial sectors to collaborate on creating solutions that ensure generative models become dependable, trustworthy sources of information in the digital era.