Scaling Synthetic Data Creation with 1,000,000,000 Personas (2406.20094v2)

Published 28 Jun 2024 in cs.CL and cs.LG

Abstract: We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a LLM to create diverse synthetic data. To fully exploit this methodology at scale, we introduce Persona Hub -- a collection of 1 billion diverse personas automatically curated from web data. These 1 billion personas (~13% of the world's total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios. By showcasing Persona Hub's use cases in synthesizing high-quality mathematical and logical reasoning problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs and tools (functions) at scale, we demonstrate persona-driven data synthesis is versatile, scalable, flexible, and easy to use, potentially driving a paradigm shift in synthetic data creation and applications in practice, which may have a profound impact on LLM research and development.

PDF HTML Abstract

Scaling Synthetic Data Creation with 1,000,000,000 Personas

The paper "Scaling Synthetic Data Creation with 1,000,000,000 Personas" introduces a comprehensive and scalable methodology for persona-driven data synthesis, presenting Persona Hub, a collection of one billion diverse personas automatically curated from web data. The authors propose that by leveraging these personas, LLMs can be guided to generate diverse and high-quality synthetic data at an unprecedented scale.

Methodology

The persona-driven data synthesis methodology operates by embedding personas into data synthesis prompts. The inclusion of personas ensures that the synthesized data encapsulates a wide array of perspectives and knowledge domains. Crucially, this methodology facilitates both zero-shot and few-shot prompting methods within an LLM, maintaining flexibility and efficacy across various data synthesis scenarios.

Persona Hub, the core of this methodology, is achieved through two key approaches: Text-to-Persona and Persona-to-Persona. The Text-to-Persona approach transforms massive web data into persona descriptions by querying the LLM to determine who would engage with the given text. This process effectively captures a broad spectrum of personas. The Persona-to-Persona approach further enriches this collection by expanding each persona through interpersonal relationships, which helps to derive less prevalent personas that may not be directly represented in web data. Extensive deduplication processes ensure that Persona Hub maintains diverse and unique entries.

Use Cases

The paper showcases the versatility and scalability of Persona Hub through various use cases:

Mathematical and Logical Reasoning Problems: LLMs prompted with personas generated diverse mathematical problems. Fine-tuning Qwen2-7B with 1.07 million synthesized math problems led to an impressive 64.9% accuracy on the MATH benchmark, rivaling models like gpt-4-turbo-preview at just a 7B scale.
Instructions (User Prompts): By simulating diverse user requests, the methodology generated rich instruction datasets, contributing to the enhancement of LLM instruction-following capabilities.
Knowledge-rich Texts: Employing personas to guide the generation of detailed plain texts ensures the coverage of a vast array of knowledge domains, instrumental for both pre-training and post-training phases of LLMs.
Game NPCs: The methodology excels at creating intricate and contextually relevant NPCs for games, reducing manual character design workload.
Tool (Function) Development: Anticipating user needs through persona-driven synthesized tools can significantly extend LLMs' functionalities, offering a dynamic approach to addressing user queries.

Evaluation and Results

The effectiveness of persona-driven synthetic data creation is evidenced by strong numerical results. For instance, using the persona-driven synthesis approach, the model fine-tuned on 1.07 million synthesized math problems achieved a notable 64.9% on MATH, indicating the high quality of data produced by this methodology. Furthermore, evaluation on both in-distribution and out-of-distribution test sets demonstrated the robustness and versatility of the approach across various datasets.

Implications and Future Directions

Practical Implications:

Data Creation Paradigm Shift: The persona-driven approach suggests a potential shift from human-centric data creation to LLM-driven synthesis, leveraging the expansive persona hub to generate diverse, high-quality datasets.
Training Data Security Risks: The methodology has raised concerns about training data security as extensive extraction of an LLM’s memory could lead to data leakage, challenging the standing of proprietary LLMs.
Simulation of Real-world Interactions: Persona Hub enables the simulation of a diverse array of real-world interactions, offering unprecedented insights into user behavior, policy impacts, and complex system dynamics in virtual spaces.

Theoretical Implications:

Distributed Carrier-based Compression: This paradigm of treating personas as distributed carriers of knowledge provides a new lens for understanding and working with LLMs, potentially allowing for full-memory access of LLMs through comprehensive synthetic data extraction.
Scalability in Multimodal Contexts: While the paper primarily focuses on text-based data synthesis, the proposed methodology holds potential for extending to multimodal LLMs, paving the way for advanced synthetic data generation in visual, audio, and interactive domains.

Future Work:

The authors plan to refine Persona Hub further by enriching persona descriptions, aiming for a granular and detailed level akin to individual Wikipedia articles. They anticipate exploring multimodal LLMs for creating synthetic data across multiple domains and investigating super personas to explore tapping the super intelligence of LLMs.

Conclusion

This paper presents a detailed and systematic approach to scaling synthetic data creation using a massive collection of personas. It highlights significant advancements in LLM capabilities and points toward a future where LLMs not only process but also create high-quality data autonomously. The research demonstrates strong empirical results and opens up new avenues for leveraging LLMs in diverse application fields, emphasizing the broad practical and theoretical implications of the persona-driven data synthesis methodology.