Assessing Web Search Credibility and Response Groundedness in Chat Assistants (2510.13749v1)

Published 15 Oct 2025 in cs.CL

Abstract: Chat assistants increasingly integrate web search functionality, enabling them to retrieve and cite external sources. While this promises more reliable answers, it also raises the risk of amplifying misinformation from low-credibility sources. In this paper, we introduce a novel methodology for evaluating assistants' web search behavior, focusing on source credibility and the groundedness of responses with respect to cited sources. Using 100 claims across five misinformation-prone topics, we assess GPT-4o, GPT-5, Perplexity, and Qwen Chat. Our findings reveal differences between the assistants, with Perplexity achieving the highest source credibility, whereas GPT-4o exhibits elevated citation of non-credibility sources on sensitive topics. This work provides the first systematic comparison of commonly used chat assistants for fact-checking behavior, offering a foundation for evaluating AI systems in high-stakes information environments.

Summary

The paper introduces a systematic methodology to assess how chat assistants use web search for credible, grounded responses.
It analyzes source credibility using metrics such as Credibility Rate (CR) and Non-Credibility Rate (NCR) across multiple assistants.
Findings reveal that user framing and topic sensitivity significantly affect evidence reliability, with Perplexity demonstrating superior performance.

Assessing Web Search Credibility and Response Groundedness in Chat Assistants

Introduction and Motivation

The integration of web search into LLM chat assistants has introduced new opportunities and risks in information-seeking applications. While retrieval-augmented generation enables assistants to ground responses in up-to-date external evidence, it also exposes them to the risk of amplifying misinformation, especially when retrieved sources are of low credibility. This paper presents a systematic methodology for evaluating the credibility of sources cited by chat assistants and the groundedness of their responses, focusing on high-stakes, misinformation-prone domains such as health, climate change, politics, and geopolitics.

Figure 1: The evaluation methodology: claims are posed from Fact-Checker or Claim Believer perspectives, assistants generate web-augmented responses, and cited sources are analyzed for credibility and groundedness.

Methodology

The evaluation pipeline consists of three main stages:

Data Collection: 100 claims were curated across five misinformation-prone topics. Each claim was presented to four chat assistants (GPT-4o, GPT-5, Perplexity, Qwen Chat) using prompt templates simulating two user roles: Fact-Checker (verification-seeking) and Claim Believer (confirmation-seeking). This dual framing captures the effect of user presuppositions on retrieval and response behavior.
Source Credibility Analysis: Cited domains in assistant responses were classified using Media Bias/Fact Check (MBFC) ratings and fact-checking organization lists. Metrics include Credibility Rate (CR) and Non-Credibility Rate (NCR), quantifying reliance on trustworthy versus low-credibility sources.
Groundedness Evaluation: Responses were decomposed into atomic factual units, which were then checked for support in the cited sources using a modified VERIFY pipeline. Groundedness was measured both overall and with respect to source credibility, distinguishing between claims supported by credible and non-credible evidence.

Source Credibility: Comparative Analysis

The assistants exhibit distinct retrieval and citation behaviors:

Perplexity consistently achieves the highest CR (86.3%) and lowest NCR (0.7%), indicating a highly selective retrieval strategy that minimizes exposure to unreliable sources.
GPT-4o and GPT-5 retrieve from a broader set of domains, increasing topical coverage but also the risk of citing non-credible sources, especially in contested domains (e.g., Russia-Ukraine War, where GPT-4o's NCR reaches 4.55%).
Qwen Chat demonstrates moderate overall credibility but greater inconsistency, with a higher proportion of unrated or marginal sources.
Figure 2: Credibility distribution of sources cited by different chat assistants. Perplexity leads in high-credibility citations; Qwen Chat cites more unrated sources.

Topic-level analysis reveals that assistants are most vulnerable to unreliable evidence in domains saturated by disinformation (e.g., Russia-Ukraine War, climate change). User framing also matters: Claim Believer prompts slightly increase NCR, particularly for OpenAI models, indicating that presuppositional queries can bias retrieval toward lower-credibility sources.

Groundedness: Factual Support and Source Quality

All assistants demonstrate high overall groundedness, with most atomic facts in responses supported by cited evidence. However, credible groundedness—support by high-credibility sources—varies:

Perplexity maintains both high groundedness and credible groundedness across topics, with minimal reliance on non-credible sources.
GPT-4o and GPT-5 are generally reliable but show increased variance and occasional grounding in non-credible sources, especially in sensitive or local topics.
Qwen Chat is more fragile: while it usually avoids unreliable sources, when it does cite them, a large fraction of the response may be grounded in low-credibility evidence.
Figure 3: Distribution of groundedness across assistants. Blue: facts grounded in credible sources; red: facts grounded in non-credible sources. Perplexity is most consistent; Qwen Chat shows higher variance.

Response-level analysis shows that all assistants can appear well-grounded even when relying on marginal or unverifiable evidence, underscoring the importance of distinguishing between overall and credible grounding.

Interface and Retrieval Behavior

The paper also documents differences in interface design and citation presentation, which affect the mapping between response segments and sources:

GPT-4o provides explicit highlighting of cited spans (Figure 4), facilitating fine-grained grounding analysis.
GPT-5 introduces an automatic "thinking mode" (Figure 5), which, when activated, increases CR and reduces NCR, suggesting that explicit reasoning steps can improve retrieval selectivity.
Perplexity and Qwen Chat require HTML-based inference to associate citations with response segments (Figures 6–8).
Figure 4: Highlighting functionality in the GPT-4o interface.

Figure 5: GPT-5 interface with automatically activated thinking mode.

Figure 6: Web interface of the Perplexity chat, from which responses were collected.

Figure 7: Sources page within the Perplexity interface, listing all retrieved sources.

Figure 8: Qwen Chat interface for collecting responses.

Implications and Theoretical Considerations

Web Search Strategies and Vulnerabilities

The trade-off between retrieval breadth and source selectivity is central. Broad retrieval (as in GPT-4o/5) increases coverage but also risk, while selective retrieval (as in Perplexity) enhances reliability at the potential cost of diversity. The positive effect of GPT-5's thinking mode suggests that explicit reasoning or multi-step retrieval can mitigate some vulnerabilities.

Groundedness vs. Credibility

Grounding alone is insufficient for trust: assistants can produce responses that are well-grounded in the sense of being supported by cited evidence, yet that evidence may be of low credibility. This distinction is critical for user trust and for the design of fact-checking systems.

Topic Sensitivity and User Framing

Contested topics and presuppositional queries (Claim Believer framing) amplify the risk of exposure to misinformation. This highlights the need for assistants to be robust not only to content but also to user intent and query framing.

Limitations

The evaluation is limited to English-language claims and a specific set of assistants with web search enabled.
The methodology relies on existing credibility ratings, which may embed regional or cultural biases.
Automated user simulations cannot capture the full diversity of real-world interactions.

Future Directions

Extending the methodology to multilingual settings and additional chat assistants.
Integrating real-time credibility assessment and dynamic source filtering into retrieval-augmented LLMs.
Systematic paper of the interaction between reasoning capabilities (e.g., thinking modes) and retrieval selectivity.
Large-scale user studies to validate findings in real-world, multi-turn conversational contexts.

Conclusion

This work establishes a rigorous, reproducible methodology for evaluating the credibility and groundedness of web-search-enabled chat assistants. The results demonstrate that while all evaluated systems can ground their responses, credible grounding is less consistent, and assistants may appear reliable while relying on low-credibility sources. Perplexity emerges as the most robust in both source selection and grounding, while GPT-4o and GPT-5 show greater sensitivity to topic and user framing. The findings underscore the need for credibility-aware retrieval and grounding mechanisms in future assistant architectures, especially for deployment in high-stakes information environments.