Synthetic Dialogue Dataset Generation using LLM Agents

Published 30 Jan 2024 in cs.CL and cs.AI | (2401.17461v1)

Abstract: Linear programming (LP) problems are pervasive in real-life applications. However, despite their apparent simplicity, an untrained user may find it difficult to determine the linear model of their specific problem. We envisage the creation of a goal-oriented conversational agent that will engage in conversation with the user to elicit all information required so that a subsequent agent can generate the linear model. In this paper, we present an approach for the generation of sample dialogues that can be used to develop and train such a conversational agent. Using prompt engineering, we develop two agents that "talk" to each other, one acting as the conversational agent, and the other acting as the user. Using a set of text descriptions of linear problems from NL4Opt available to the user only, the agent and the user engage in conversation until the agent has retrieved all key information from the original problem description. We also propose an extrinsic evaluation of the dialogues by assessing how well the summaries generated by the dialogues match the original problem descriptions. We conduct human and automatic evaluations, including an evaluation approach that uses GPT-4 to mimic the human evaluation metrics. The evaluation results show an overall good quality of the dialogues, though research is still needed to improve the quality of the GPT-4 evaluation metrics. The resulting dialogues, including the human annotations of a subset, are available to the research community. The conversational agent used for the generation of the dialogues can be used as a baseline.

Abstract PDF HTML Upgrade to Chat

Authors (5)

References (13)

Citations (14)

View on Semantic Scholar

Summary

The paper introduces a dual-agent framework leveraging GPT-4 for synthesizing dialogues that guide non-experts in constructing linear programming models.
It employs a QG Agent and a QA Agent to simulate 476 dialogue interactions that accurately capture key LP problem details.
Evaluation using human metrics and GPT-4 chain-of-thought prompting validates the dialogues’ informativeness and readability while highlighting areas for prompt improvement.

Synthetic Dialogue Dataset Generation using LLM Agents

Introduction

The paper "Synthetic Dialogue Dataset Generation using LLM Agents" focuses on addressing the challenges faced by individuals without specialized mathematical backgrounds in formulating linear programming (LP) models. LP is widely used in real-world applications such as resource allocation, but its broader utilization is hampered by the difficulty non-experts have in defining LP models. To tackle this, the paper proposes a goal-oriented conversational agent that assists users in constructing accurate linear models by engaging them in dialogues. The research centers on generating synthetic dialogues for training and evaluating such an agent, using multiple evaluation methods to assess dialogue quality.

Methodology

The core methodology involves leveraging prompt engineering to construct two LLM-based agents: a Question Generation (QG) Agent and a Question Answering (QA) Agent. The QG Agent asks questions to elicit essential information about the LP problem, simulating a goal-oriented conversational agent, while the QA Agent simulates the user, providing information based on predefined problem statements from the NL4Opt dataset. Importantly, the QG Agent lacks direct access to the problem statement, relying solely on the QA Agent's answers for information.

This dual-agent setup facilitates simulated dialogues wherein the QG Agent iteratively gathers the information necessary to construct a valid LP model. The system uses OpenAI's GPT-4 API to automate dialogue generation, creating a dataset of synthetic dialogues. The dataset includes 476 dialogues, with 28 manually annotated for human evaluation, thereby contributing a valuable resource to the research community.

Evaluation Strategy

The evaluation of generated dialogues involves both human and automated methods to ensure robustness. The extrinsic evaluation assesses how well the dialogues' summaries align with the original problem descriptions. Human evaluation metrics include Information Recall, Information Precision, Information Repetition, and Readability. Automated metrics involve traditional measures like ROUGE and BERTScore, along with a novel evaluation approach using GPT-4 to simulate human judgement. The GPT-4-based evaluation employs "chain-of-thought" prompting to gauge the accuracy and comprehensiveness of the dialogue summaries.

Results

Preliminary results indicate that the synthetic dialogues closely resemble the original problem descriptions, verifying the dialogue generation approach's effectiveness. Human evaluators report high scores in terms of summary informativeness and readability. However, slight discrepancies in Information Repetition were noted. Automated evaluation metrics show a reasonable correlation with human assessments, particularly for ROUGE-L and BERTScore Precision, aligning well with human Information Precision ratings.

The use of GPT-4 for automatic evaluation demonstrates competitive results, albeit with a slight tendency to overrate compared to human evaluators. Further refinements in prompt design are required to enhance alignment with human evaluators.

Implications and Future Work

The research implies significant practical applications in automating LP model formulation, potentially increasing the accessibility of LP methodologies to non-experts. The dialogue generation and evaluation frameworks presented can be adapted for other optimization model contexts, broadening their utility.

Future work includes refining GPT-4 prompts for better alignment with human evaluations, and conducting detailed dialogue turn-level analyses to gain deeper insights into the dialogue generation process. Additionally, expanding evaluation methods tailored specifically to LP modeling tasks will augment the dataset's applicability in more diverse problem-solving scenarios.

Conclusion

The paper contributes a novel dataset and framework for synthetic dialogue generation, enhancing the development of conversational agents for LP model formulation. It demonstrates effective dialogue synthesis using LLM agents, supporting broader applications in automated optimization modeling. Further improvements in evaluation techniques and adaptation to diverse problem types hold promise for ongoing advancements in this domain.

Markdown Report Issue