WildClaims Dataset Overview
- WildClaims is a large-scale dataset that systematically catalogs over 121,000 factual claims extracted from 3,000 real user–ChatGPT conversations.
- It utilizes dual extraction methods, F_Song and F_Huo, followed by advanced check-worthiness annotation with automated and manual quality control measures.
- The dataset underpins research in fact verification, retrieval augmentation, and argument mining by highlighting the prevalence of implicit factual assertions in conversational systems.
WildClaims is a large-scale annotated dataset designed to support the paper of implicit factual assertions and check-worthiness in real-world conversational AI interactions. Derived from user–ChatGPT dialogues in the WildChat corpus, WildClaims systematically catalogs factual claims embedded across a diverse range of conversational contexts, each claim annotated for its check-worthiness. The dataset foregrounds the prevalence and characteristics of implicit information access—factual content presented by AI systems even in ostensibly non-informational conversations—and constitutes a crucial resource for advancing fact verification, retrieval augmentation, and argument mining in conversational agents (Joko et al., 22 Sep 2025).
1. Composition and Extraction Methodology
WildClaims is curated from 3,000 real user–ChatGPT conversations filtered to retain only "in the wild" English interactions, excluding domains such as coding or mathematics. The dataset comprises 15,174 utterances, of which 7,587 are system-generated.
Factual claims are extracted exclusively from system utterances. Two automated methods, denoted and , are employed to identify explicit factual content:
- extracts approximately 90,797 claims.
- extracts approximately 31,108 claims.
For each system utterance , the extraction is formalized as , where represents the multi-turn conversational history. Collectively, this results in a dense set of 121,905 factual claims, reflecting the factual density of modern conversational systems. This dual-method approach allows comprehensive coverage of both explicit and more implicit factual content within conversational turns.
2. Annotation Pipeline and Quality Control
Extraction is followed by a rigorous annotation for check-worthiness—a measure of whether an identified factual claim merits third-party verification. The annotation procedure is staged as follows:
- Initial extraction is performed with prompting techniques using GPT-4.1, modified to utilize conversation context.
- Each claim is then assessed for check-worthiness, defined as requiring external fact-checking.
Manual validation is conducted on a stratified sample: 200 claims (100 per extraction method, each from a unique conversation) are independently annotated by two annotators, with discrepancies resolved by a third. Inter-annotator agreement, quantified via Cohen’s kappa, is 0.672 for and 0.580 for , indicating moderate to substantial reliability.
Additionally, automated check-worthiness classification is performed using two reference models, CW and CW, along with their union and intersection. A "check-worthy utterance" is an utterance containing at least one claim with a positive classifier output, formalized as .
3. Prevalence and Distribution of Check-Worthy Claims
Observational and experimental analyses reveal that implicit factual assertion is ubiquitous in real-world conversational systems. Conservative estimates, which assume at most one factual claim per conversation, yield the following prevalence:
| Extraction Method | % Conversations with Factual Claims | % Check-Worthy (per claim) | % Conversations w/ Check-Worthy Claim |
|---|---|---|---|
| 45.1% | 40% | 18% | |
| 79.0% | 64% | 51% | |
| Union (Automatic) | N/A | N/A | 76% |
More liberal estimation using the union of CW and CW raises this figure to 76%. This indicates that the majority of user–ChatGPT conversations, regardless of ostensible topic or user intent, contain at least one factual assertion that is check-worthy by established criteria.
4. Conceptualization and Cross-Domain Claim Detection
WildClaims intersects with foundational research on claim conceptualization and detection (Daxenberger et al., 2017). The nature of "claims" differs markedly across source corpora—ranging from emotional utterances and subjective opinions in online comments to structured arguments in persuasive essays. For WildClaims, which is rooted in "wild," heterogeneous online discourse, implications include:
- The divergence in claim conceptualization can cause transfer performance degradation; thus, systems require robustness to definitional shifts or broad-spectrum training data.
- Cross-domain experiments demonstrate that models exploiting lexical and syntactic features—especially leveraging strong lexical cues such as modal verbs ("should")—generalize better to heterogeneous domains akin to WildClaims.
- Training on noisy, diverse datasets improves cross-domain portability and the detection of latent claim signals.
A regression analysis formalizes the relationship between lexical similarity, claim frequency, and claim/non-claim imbalance and their influence on detection macro-F:
where is performance when training on source and testing on target , is lexical similarity, the number of claims in source, and the claim/non-claim ratio in the target.
5. Implications for Conversational Information Access
WildClaims empirically demonstrates a paradigm shift in human–AI information access. Rather than explicit queries dominating, factual transfer now occurs predominantly via system-generated implicit claims embedded within broader, often creative or editorial, interactions. The resource thus reveals:
- Implicit information transfer is widespread: up to 76% of real-world conversations entail check-worthy assertions.
- Modern conversational agents routinely emit factual content even absent clear user intent to seek information.
- This prevalence challenges traditional information retrieval and dialogue system models, necessitating reconsideration of verification, retrieval augmentation, and user interface design.
6. Research Applications and Methodological Insights
WildClaims offers operational value for multiple research directions:
- Benchmarking and improving algorithms for automatic factual claim detection and classification of check-worthiness.
- Advancing retrieval-augmented generation through integration of automated verification pipelines, directly addressing the risk of inadvertent misinformation spread outside explicit Q&A scenarios.
- Developing user simulators encoding a spectrum of implicit and explicit user intents, thereby broadening the applicability and fidelity of conversational system evaluation.
In light of findings from cross-domain claim identification research (Daxenberger et al., 2017), successful systems for WildClaims will likely integrate ensembles of lexical/syntactic feature-rich models and deep neural architectures, supported by transfer and multi-task learning frameworks attuned to definitional variance and the heterogeneous nature of conversational data.
7. Future Directions
WildClaims highlights several routes for future research:
- Establishing a unified conceptualization of "claim" across datasets to mitigate cross-domain gaps and improve transfer performance.
- Further developing multi-task learning approaches that encourage domain-agnostic representations while adapting to corpus-specific expressions.
- Integrating both rule-based and semantically-rich techniques to capture overt and subtle forms of factual assertion, improving coverage and reliability.
- Designing conversational agents equipped for dual roles: explicit answer provision and real-time verification or flagging of implicit factual assertions, especially as conversational interfaces become principal vectors for information dissemination.
WildClaims thus stands as a pivotal resource for the examination of implicit factual assertion, check-worthiness detection, and the future of reliable knowledge transfer through large-scale conversational AI systems (Joko et al., 22 Sep 2025).