Faithful Persona-based Conversational Dataset Generation with Large Language Models (2312.10007v1)

Published 15 Dec 2023 in cs.CL and cs.LG

Abstract: High-quality conversational datasets are essential for developing AI models that can communicate with users. One way to foster deeper interactions between a chatbot and its user is through personas, aspects of the user's character that provide insights into their personality, motivations, and behaviors. Training NLP models on a diverse and comprehensive persona-based dataset can lead to conversational models that create a deeper connection with the user, and maintain their engagement. In this paper, we leverage the power of LLMs to create a large, high-quality conversational dataset from a seed dataset. We propose a Generator-Critic architecture framework to expand the initial dataset, while improving the quality of its conversations. The Generator is an LLM prompted to output conversations. The Critic consists of a mixture of expert LLMs that control the quality of the generated conversations. These experts select the best generated conversations, which we then use to improve the Generator. We release Synthetic-Persona-Chat, consisting of 20k conversations seeded from Persona-Chat. We evaluate the quality of Synthetic-Persona-Chat and our generation framework on different dimensions through extensive experiments, and observe that the losing rate of Synthetic-Persona-Chat against Persona-Chat during Turing test decreases from 17.2% to 8.8% over three iterations.

PDF HTML Abstract

Evaluating Synthetic Dataset Generation with LLMs

High-quality conversational datasets are foundational for training AI systems that can understand and engage with users on a personal level. This is where the concept of 'personas' comes in handy. Personas are abstract profiles that encapsulate user characteristics like preferences and background stories. They are crucial for creating conversational models that foster deeper connections with users and maintain engagement.

Creating Persona-based Conversations

This paper introduces an innovative approach to generating persona-based conversations using LLMs. The authors leverage LLMs' impressive generation capabilities to expand an initial seed dataset, creating diverse user personas that are then paired to participate in conversations. The main idea of their Generator-Critic framework is an iterative process where the Generator produces conversation samples and the Critic—mixed with experts—evaluates the quality of these samples based on set policies such as coherence, faithfulness to the persona, and non-toxicity.

Maintaining Quality and Diversity

The framework presents a user generation step which not only augments seed personas but ensures a level of consistency and non-redundancy in user profiles. The persona expansion module is especially significant, which, through query induction and bootstrapping, creates an extensive set of queries that aids in the generation of detailed and specific personas. The user pairing then takes these profiles and matches users for conversations based on similarities, and finally, conversations are generated and refined iteratively, again aligning with set quality criteria.

Synthetic-Persona-Chat: The Outcome

The result of employing such a framework is Synthetic-Persona-Chat (SPC), a dataset with rich and faithful persona-based conversations suitable for training personalized conversational models. Not just that, this generation method proves to be dynamic, allowing for the incorporation of new topics and user traits and automatically updating personas with minimal human intervention, distinguishing it from past datasets that often were static and outdated.

Confirmation Through Evaluation

Evaluations through various dimensions confirm the quality of this newly created dataset. Comparative studies with existing datasets like Persona-Chat show SPC's conversations are more coherent, rich in persona-specific content, and less toxic. Predictive models trained on SPC show improved performance, suggesting the dataset's superiority in representing complex user interactions. In human evaluations, SPC demonstrates conversations that closely mirror natural human dialogues, and the persona attributes are consistently aligned with the utterances, showcasing the dataset's faithfulness.

The Potential and Limitations

The proposed framework signifies a considerable advancement in unsupervised persona-based dataset creation. It poses the exciting potential to be adapted for generating other specialized datasets, thereby advancing the training of AI in various niches. However, this process's success leans heavily on the quality of the underlying LLM, suggesting a boundary that future work could aim to push. Additionally, the generation process presumes ideal conversation circumstances, not accounting for less predictable elements of human conversation. Future explorations might refine quality critics or find ways to simulate more dynamic conversation turnarounds.

Conclusion

The paper contributes notably to the field of AI conversations by tackling previous challenges associated with dataset generation, especially regarding persona modeling aspects and dynamism. This novel method presents the AI community with a robust resource in Synthetic-Persona-Chat and a framework that exemplifies how LLMs can serve as powerful tools for generating conversation data that can keep up with the evolving landscape of human-AI interaction.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Pegah Jandaghi (9 papers)
XiangHai Sheng (3 papers)
Xinyi Bai (11 papers)
Jay Pujara (44 papers)
Hakim Sidahmed (6 papers)

Citations (10)

View on Semantic Scholar