WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training (2501.18511v2)

Published 30 Jan 2025 in cs.LG and cs.CL

Abstract: LLM post-training, from DPO to distillation, can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WILDCHAT-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating RE-WILD, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples. Our dataset, samples and code are available at https://github.com/penfever/wildchat-50m.

PDF Abstract

An Analysis of "WildChat-50m: A Deep Dive Into the Role of Synthetic Data in Post-Training"

The paper "WildChat-50m: A Deep Dive Into the Role of Synthetic Data in Post-Training" introduces WildChat-50m, the largest publicly available synthetic chat dataset, generated using 50 open-weight LLMs. It provides a comprehensive examination of the utility of synthetic data in post-training and proposes advancements in the methods and resources available for improving LLMs.

Contribution of the WildChat-50m Dataset

WildChat-50m significantly enhances the potential for open research in LLM post-training by offering a dataset generated from a diverse set of models ranging from 0.5 billion to 104 billion parameters. The dataset comprises around 125 million chat transcripts, making it a valuable resource for the research community to explore the impacts of synthetic data on model refinement. The data was collected using an efficient new setup that included models running on various GPUs and frameworks like VLLM, ensuring high throughput and cost-effectiveness in terms of resource utilization.

Comparative Analysis and Performance

The paper emphasizes the analysis of different LLMs concerning their runtime efficiency and VRAM usage, noting significant differences in throughput speed and token processing capability. For instance, the Qwen2.5-72B-Instruct model operates significantly slower than the Llama-2-7B-Chat, highlighting important considerations for resource allocation in LLM development projects.

Furthermore, the paper challenges assumptions about synthetic data quality (SDQ), positing that response similarity across diverse models is unexpectedly high, suggesting that LLM-generated responses are more aligned in structure and content than human-generated ones in similar tasks.

The Re-Wild SFT Mix

By leveraging WildChat-50m, the researchers devised Re-Wild, a novel supervised fine-tuning (SFT) mix supporting improved performance in downstream tasks compared to established mixtures like Tulu-3. Despite using a reduced dataset size (40% of Tulu-3), Re-Wild demonstrated superior results across generalist chat and instruction-following benchmarks. The paper attributes this success to the Re-Wild mix's strategic blend of datasets, emphasizing different aspects of LLM capabilities.

Insights and Implications

The findings suggest that the choice of data-generating model (DGM) profoundly impacts post-training outcomes. The variations in SDQ are pronounced and unpredicted by simple metrics like model size. This points to an opportunity for researchers to focus more on the particularities of the DGMs used for synthetic data production when training LLMs.

The paper also reveals that models trained using responses from DGMs of the same family or similarly structured models tend to perform better, resonating with findings in the field regarding on-policy training benefits. Additionally, the outcomes suggest that blending multiple DGM sources does not necessarily yield better models, contradicting the broadened data diversity hypothesis typically postulated.

Theoretical and Practical Implications

WildChat-50m provides substantial theoretical and practical implications for AI research. Theoretically, it allows for more detailed explorations of SDQ, advancing current understanding of post-training impacts. Practically, it equips smaller labs with necessary datasets and insights into cost-effective training approaches, potentially leveling the field between academic institutions and industrial laboratories.

Future Research Directions

This work opens several avenues for future research, including exploring alternative post-training methods and diversifying benchmark tasks to encompass specific domains. Moreover, the paper underscores a need to investigate the nuanced relationships between model architecture, dataset characteristics, and performance across different types of tasks.

In conclusion, the paper makes a compelling case for the role of synthetic data in advancing LLM post-training while providing valuable datasets and insights to the research community. Through resources like WildChat-50m and developments like Re-Wild, researchers are better positioned to pursue optimized training strategies and novel inquiry in the burgeoning field of artificial intelligence.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Benjamin Feuer (16 papers)
Chinmay Hegde (109 papers)

Related Papers

Find Related Papers

GitHub

GitHub - penfever/wildchat-50m: Code, results and other artifacts from the paper introducing the WildChat-50m dataset and the Re-Wild model family.

Tweets

https://twitter.com/vanstriendaniel/status/1885257041440051523

https://twitter.com/maximelabonne/status/1885811194243518755