Papers
Topics
Authors
Recent
2000 character limit reached

WILDCHAT-50M: Synthetic Conversational Dataset

Updated 28 December 2025
  • WILDCHAT-50M is a large-scale synthetic conversation dataset designed for advanced LLM post-training, comprising 50M multi-turn exchanges and over 1.2B tokens.
  • The dataset utilizes 54 diverse DGM checkpoints with VLLM-based generation, ensuring rich stylistic, topical, and architectural diversity in responses.
  • It underpins robust supervised fine-tuning and RLHF pipelines, demonstrating empirical performance improvements through systematic benchmarking and scalable data synthesis.

WILDCHAT-50M is a large-scale, publicly accessible synthetic conversational dataset designed to advance research in LLM post-training, synthetic data generation, and benchmarking. Originating as an extension and generalization of earlier WildChat corpora, WILDCHAT-50M uniquely aggregates responses from over 50 open-weight data-generating models (DGMs), ranging from 0.5 billion to 104 billion parameters. It provides a diverse, systematically constructed resource for both supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) pipelines, as well as empirical analysis of model behavior and synthetic data utility (Feuer et al., 30 Jan 2025).

1. Corpus Composition and Scale

WILDCHAT-50M comprises approximately 50 million distinct multi-turn synthetic conversations, totaling around 125 million utterances (user prompts or model responses) and exceeding 1.2 billion tokens across all entries. Each conversation is generated by distributing curated "in-the-wild" user prompts—originally collected in WildChat-1M—across a suite of 54 distinct DGM checkpoints. The models include 19 unique pretrained families and 35 post-trained or instruction-tuned variants, covering Llama-3 (8B, 70B, 3.3-70B), Mistral-7B and SFTs, Qwen-2 (7B), Qwen-2.5 (14B, 72B, 32B coder), Cohere-Command-R-Plus (104B), AI21-Jamba-Mini (1.5B), Gemma-2 (9B, 27B), InternLM2.5-20B, and numerous others. Each prompt yields 1–3 response turns per model. The design achieves broad stylistic, topical, and architectural diversity (Feuer et al., 30 Jan 2025).

Attribute Value Notes
# Conversations ≈ 50,000,000 Multi-turn, synthetic
# Utterances ≈ 125,000,000 User + model
# DGMs 54 (19 families, 35 variants) 0.5B–104B parameters
# Total tokens > 1.2 billion All models combined
Prompt sources WildChat-1M (user ↔ GPT) Original, in-the-wild

2. Data Generation and Methodology

The construction of WILDCHAT-50M proceeds by allocating each WildChat-1M user prompt or prompt history to all 54 DGMs. Each response is generated using VLLM (v0.7+) with top-p sampling (p=0.9), temperature 0.7, and no beam search to encourage response diversity. Generation was executed on a 12×8 H100 GPU cluster over approximately two months (circa 10,000 H100-GPU hours). Conversation objects are formatted in JSONL, containing conversation_id, model, speaker, text, token count, and timestamp. Responses are quality-controlled: empty/timeout answers are discarded, duplicates filtered, out-of-bounds length responses truncated, and original prompt profanity/toxicity is flagged but retained for research use (Feuer et al., 30 Jan 2025).

A typical data generation pseudocode is:

1
2
3
4
5
6
7
8
9
10
11
12
for each prompt in WildChat-1M:
    context = [prompt]
    for model in all_checkpoints:
        response = vllm.generate(
            model_name=model,
            context=context,
            max_new_tokens=turn_budget,
            temperature=0.7,
            top_p=0.9
        )
        record conversation_id, model, context, response
        # Optionally append response to context for multi-turn

Prompt types include information-seeking queries, task-oriented instructions, and open-ended conversational setups, with roughly 55–65% of conversations being conversational, 25–30% instruction-style, and 10–15% direct questions.

3. Synthetic Benchmarking and Comparative Analysis

WILDCHAT-50M enables robust benchmarking of synthetic SFT mixes relative to other large-scale resources. The curated "Re-Wild SFT Mix," constructed from WILDCHAT-50M (246,750 WildChat-Q72, 99,800 MMLU, 20,000 Tulu-3 Algebra), was found to outperform the Allen AI Tulu-3 SFT mixture across nine human- and LLM-judged benchmarks, despite using only ~40% as many samples. Evaluation includes both human scoring (Evalchemy: MixEval, AlpacaEval2, MTBench with GPT-4o-mini judging) and automated ground-truth metrics (BBH, GPQA, MATH, MUSR, IFEval, MMLU).

Key comparative findings:

  • DGM selection strongly impacts downstream SFT results; "Avg" composite scores for SFT Llama-3.1-8B using six different DGMs (7–104B) ranged from 0.35 to 0.42.
  • Empirical scaling: increasing synthetic data volume from 100K to 500K examples yields roughly linear improvement for non-GPT DGMs, but plateaus earlier for GPT-based models.
  • Blending responses from two DGMs yields performance interpolating between the two; no evidence for performance synergy.
  • On-policy SFT (fine-tuning a base model on its outputs or in-family data) provides incremental performance improvements (Feuer et al., 30 Jan 2025).

4. File Formats, Licensing, and Reproducibility

WILDCHAT-50M is distributed via github.com/penfever/wildchat-50m under the AI2 ImpACT License. Usage is limited to research contexts; high-risk, military, or commercial use and deanonymization attempts are prohibited. Data are organized as JSONL—one object per turn—containing fields for conversation_id, model, speaker, text, tokens, and timestamp. Prebuilt Parquet exports facilitate efficient ingestion. Supporting resources include data loading scripts, Axolotl-based SFT configuration files (YAML), notebook-based SFT mix generators, and standard evaluation recipes built on Evalchemy and LM Eval Harness (Feuer et al., 30 Jan 2025).

5. Best Practices in SFT and RLHF Integration

The WILDCHAT-50M dataset supports best practices for SFT and RLHF:

  • For SFT, initializing with the Re-Wild mix (366,550 examples, ~5–6 hrs on a 4×H100 node) using AdamW (lr = 2×10⁻⁵, 1 epoch, cosine schedule, bf16, gradient checkpointing, flash attention) yields models competitive with far larger SFT mixtures.
  • Early training epochs should overweight synthetic chat samples from high-SDQ DGMs (e.g., Qwen2.5-72B), with knowledge-specialized sets (MMLU, persona algebra) introduced later.
  • Retaining 10–20% of the SFT budget for on-policy or in-family SFT (e.g., Llama-3.1-8B outputs) stabilizes stylistic alignment.
  • For RLHF, WILDCHAT-50M facilitates preference model warm-start by supporting pairwise model response comparisons and transfer of stylistic properties, with reward models augmented to penalize or encourage specific presentational styles derived from DGM outputs (Feuer et al., 30 Jan 2025).

WILDCHAT-50M is positioned as the largest publicly available synthetic chat corpus, providing a standardized foundation for research into the effects of DGM diversity, data scaling, and synthetic corpus construction on post-training performance. Its design enables large-scale, reproducible comparative analysis across multiple LLM families and fine-tuning pipelines. The resource complements observational datasets such as WildClaims, which focuses on extracted factual claims and implicit information-seeking phenomena from real user–ChatGPT conversations (“WildChat-50M” also serves as an informal alias for the full real-user message corpus in some contexts, but WILDCHAT-50M, as released, refers specifically to the open-weight, multi-model synthetic dataset) (Joko et al., 22 Sep 2025, Feuer et al., 30 Jan 2025).

A plausible implication is that the systematic scaling and curation strategies illustrated by WILDCHAT-50M may inform best practices for both open and proprietary LLM research, lowering requirements for large SFT pools while maximizing downstream performance diversity and robustness. The resource invites further work on the comparative utility of synthetic versus real-user conversational data, DGM-specific stylistic transfer, and the intersection of privacy-preserving data dissemination with open-science goals.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to WILDCHAT-50M Dataset.