WildChat: 1M ChatGPT Interaction Logs in the Wild

Published 2 May 2024 in cs.CL | (2405.01470v1)

Abstract: Chatbots such as GPT-4 and ChatGPT are now serving millions of users. Despite their widespread use, there remains a lack of public datasets showcasing how these tools are used by a population of users in practice. To bridge this gap, we offered free access to ChatGPT for online users in exchange for their affirmative, consensual opt-in to anonymously collect their chat transcripts and request headers. From this, we compiled WildChat, a corpus of 1 million user-ChatGPT conversations, which consists of over 2.5 million interaction turns. We compare WildChat with other popular user-chatbot interaction datasets, and find that our dataset offers the most diverse user prompts, contains the largest number of languages, and presents the richest variety of potentially toxic use-cases for researchers to study. In addition to timestamped chat transcripts, we enrich the dataset with demographic data, including state, country, and hashed IP addresses, alongside request headers. This augmentation allows for more detailed analysis of user behaviors across different geographical regions and temporal dimensions. Finally, because it captures a broad range of use cases, we demonstrate the dataset's potential utility in fine-tuning instruction-following models. WildChat is released at https://wildchat.allen.ai under AI2 ImpACT Licenses.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (99)

View on Semantic Scholar

Summary

The paper introduces WildChat, a large-scale, multilingual dataset of over 1M real-world ChatGPT interactions with extended, multi-turn dialogues.
It details an ethical data collection process using publicly accessible GPT APIs and robust user consent, ensuring diverse demographic and geographic insights.
Initial evaluations indicate that models fine-tuned on WildChat perform robustly across benchmarks, offering valuable insights into AI safety and conversational depth.

Exploring WildChat: A New Dataset for Conversational AI

Introduction to WildChat

WildChat is a large, public dataset containing over 1 million user-chatbot conversations, gathered from actual interactions between online users and chatbots powered by ChatGPT and GPT-4. This dataset stands out due to its real-world, multi-turn conversations that include a rich mix of demographics and linguistic diversity.

Key Features of the Dataset

Scale and Diversity: WildChat is composed of over 2.5 million interaction turns within more than a million conversations, covering 68 different languages. This makes it one of the most extensive and diverse datasets in terms of language and cultural representation.
Enhanced Data for Analysis: Apart from the interaction data, WildChat includes demographic information and request headers that provide insights into geographical and temporal user behavior patterns.
Utility for Research and Model Training: Initial studies show that models fine-tuned on WildChat perform robustly on various benchmarks, indicating its potential utility for improving conversational AI.

Collecting WildChat: A Deep Dive into Methodology

Data for WildChat was collected through a free, publicly accessible service powered by the GPT-3.5 and GPT-4 APIs. A notable aspect of the data collection process was the robust user consent mechanism ensuring participants were fully informed and consented to the data use. This approach not only aligns with ethical standards but also boosts the dataset's credibility and utility.

Unpacking the Dataset Content

One of the pillars of WildChat is its rich content, characterized by:

Multi-turn Interactions: Unlike datasets that focus on single-shot conversations, WildChat includes extended, multi-turn interactions that mirror more natural conversation flows.
Cross-Geographical Insights: With detailed geographic tags (up to state and country level), WildChat allows for nuanced studies of regional differences in how chatbots are used.

WildChat in Comparison with Other Datasets

When placed side by side with other datasets like Alpaca, ShareGPT, and LMSYS-Chat-1M, WildChat shines particularly in:

Linguistic Diversity: It tops the charts not just in the number of languages covered but also in the proportional representation of non-English conversations.
Conversation Length and Depth: Featuring some of the longest average turns among its peers, it's set up to provide deeper insights into conversational context and user engagement.

Spotlights on Toxicity: A Crucial Conversation

A significant revelation from the data was the presence of toxic interactions, with over 10% involving harmful or unsafe content. This facet of the dataset is particularly critical as it provides a valuable resource for developing more resilient AI models that can effectively handle and mitigate negative interactions.

Future Implications and Continuing the Conversation

The release of WildChat under the AI2 ImpACT Licenses promises to catalyze further developments in conversational AI in several ways:

Enhanced Model Training: By providing a real-world training ground, WildChat helps in developing chatbots that are not only effective but also culturally and linguistically inclusive.
Groundwork for Safety Measures: Insights from toxicity analysis could drive advancements in AI safety, crafting protocols that preempt and prevent misuse.
Open Research Opportunities: With its open-access nature, WildChat invites the global research community to explore novel conversational AI dynamics, user behavior analytics, and more.

In conclusion, WildChat doesn't just add to the number of datasets available but enhances the quality and depth of research that can be conducted in the field of conversational AI. Its real-world, diverse, and detailed nature holds the promise of significant advancements in how chatbots understand and interact with users across the globe.

Markdown Report Issue