WildChat Dataset Overview
- WildChat is a multilingual conversational dataset compiled from over 1 million anonymized user–chatbot interactions with detailed demographic and geographic metadata.
- It employs strict user consent protocols and robust anonymization techniques to ensure privacy-preserving, ethical data collection.
- The dataset supports various research applications including instruction tuning, toxicity mitigation, and behavioral analysis in advanced conversational agents.
WildChat is a large-scale, multilingual, and demographically enriched corpus of user–chatbot interaction logs, designed to address the previously unmet need for public datasets that reveal how real users engage with advanced conversational agents such as ChatGPT and GPT-4 in authentic settings. Compiled through an affirmative opt-in process that ensured explicit user consent and robust anonymization, WildChat comprises over one million user–ChatGPT conversations (encompassing more than 2.5 million interaction turns) and is released under the AI2 ImpACT License to promote responsible, privacy-preserving research.
1. Dataset Composition and Structure
WildChat consists of approximately 1,039,785 full conversations, each representing a multi-turn dialogue between an anonymous user and a ChatGPT model (GPT-3.5-Turbo or GPT-4). The dataset records a total of over 2.5 million interaction turns, with a mean of about 2.52 turns per conversation. Long-tail dynamics are present, as approximately 3.7% of conversations exceed 10 turns.
The dataset is distinctly multilingual: turn-level language classification using lingua-py identifies 68 languages. The distribution is heavily skewed toward English (53% of turns), with Chinese (13%) and Russian (12%) as the next most prevalent. Each chat entry contains the full textual record of both the user prompt—which may comprise substantial conversation context—and the chatbot response.
A representative summary table of dataset parameters (formatted in LaTeX as in the original paper) is:
$\begin{array}{lcccccc} \textbf{Dataset} & \# \text{Convs} & \# \text{Users} & \# \text{Turns} & \text{User Tok} & \text{Chatbot Tok} & \# \text{Langs} \ \hline \text{WildChat} & 1,\!039,\!785 & 204,\!736 & 2.54 & 295.58\pm1609.18 & 441.34\pm410.91 & 68 \ \end{array}$
Turn distribution, token count statistics, and user prompt length all exhibit heavy tails, with some conversations featuring exceptionally long inputs or complex interaction chains.
2. Data Collection Methodology and Ethics
WildChat was assembled by hosting two publicly accessible chatbot deployments via Hugging Face Spaces, linked to the GPT-3.5-Turbo and GPT-4 APIs. Collection started on April 9, 2023 and continued through May 1, 2024, with an explicit intention for ongoing updates.
Data acquisition strictly adhered to an explicit, multi-step user consent protocol. Users were fully informed about data capture, subsequent use, and public sharing prior to participation. No user accounts were required. All sensitive metadata—including IP addresses and geolocation at the state and country level—were processed using cryptographic hashing before release to the community.
Each conversation record includes:
- Full user prompt and chatbot response text
- Timestamp metadata
- Request headers (such as browser versions and accepted languages)
- Geographical metadata (country, state, hashed IP)
This design enables granular longitudinal and cross-sectional analysis while preserving user anonymity.
3. Comparative Analysis with Existing Datasets
WildChat has been evaluated against notable contemporaneous datasets, including Alpaca, Open Assistant, Dolly, ShareGPT, and LMSYS-Chat-1M. Distinguishing characteristics emerge across key dimensions:
Dataset | Conversations | User Turns (avg.) | Languages | Special Features |
---|---|---|---|---|
WildChat | 1,039,785 | 2.54 | 68 | Demographic metadata, toxicity |
LMSYS‑Chat‑1M | 1,000,000 | 2.17 | ≈5 | Comparison across open LLMs |
ShareGPT | 90,000 | 2.0+ | 1 | Freeform user–chatbot logs |
WildChat surpasses its peers in:
- Total number of interactions and multilingual coverage (68 languages; most others have only 1–5).
- Length and diversity of user prompts and chatbot responses (average user token count is 295.58, with a long-tailed distribution).
- Breadth of user intent and conversational scenarios, including a higher variety and prevalence of potentially toxic cases (over 10% of user turns flagged as toxic by automated detectors).
This diversity and scale position it as a superior resource for studying the real-world deployment of advanced conversational agents.
4. Demographic, Geographic, and Temporal Attributes
WildChat includes richly annotated demographic and geographic metadata. Approximately 204,736 unique users are identified via hashed IP addresses, with per-conversation country and state labels. A significant fraction of users originate from the United States (21.60%), Russia (15.55%), and China (10.02%).
The presence of this metadata enables research into:
- Geographic heterogeneity in chatbot use and behavior
- Temporal analysis of interaction patterns across time zones and regions
- Sociolinguistic phenomena, including differences in user intent and content between communities
All demographic attributes are anonymized, supporting responsible investigation into user dynamics without compromising privacy.
5. Applications, Benchmarks, and Model Training
WildChat's content and metadata underpin multiple research streams:
- Instruction tuning: The dataset is demonstrated as a training resource by fine-tuning a Llama-2 7B model ("WildLlama"), achieving competitive results on the MT-Bench evaluation.
- Toxicity research and mitigation: The breadth of potentially problematic user inputs provides a robust testbed for studying and benchmarking toxicity detection and control techniques.
- Behavioral analysis and alignment: WildChat enables the paper of longitudinal changes, regional differences in chatbot use, and emergent sociolinguistic patterns in multi-turn, multilingual conversations.
The diversity of user intent and scenario types in WildChat supports generalization for both model training and evaluation use cases.
6. Access, Licensing, and Responsible Use
WildChat is available at https://wildchat.allen.ai. Distribution is governed by the AI2 ImpACT License (https://allenai.org/impact-license), which places explicit emphasis on privacy-respecting and ethical research practices. The license is designed to facilitate academic and scientific investigation while ensuring that the treatment of user data conforms to best practices for anonymity, transparency, and compliance.
7. Subsequent Development and Influence
WildChat has served as the basis for subsequent large-scale datasets and empirical studies. Notably, the WILDCHAT-50M dataset (Feuer et al., 30 Jan 2025) extends WildChat to include synthetic conversations derived from over 50 open-weight models, providing a public resource for comparative analysis across instruction-tuned models and fostering improved methodology in synthetic data curation. Likewise, applied studies such as the use of WildChat in statutory ambiguity research (He et al., 1 Sep 2025) and in analyses of developer–LLM interaction quality (Zhong et al., 12 Sep 2025) illustrate its continued importance as a benchmark corpus for the AI research community.