Papers
Topics
Authors
Recent
Search
2000 character limit reached

Synthetic Dialogue Dataset Generation using LLM Agents

Published 30 Jan 2024 in cs.CL and cs.AI | (2401.17461v1)

Abstract: Linear programming (LP) problems are pervasive in real-life applications. However, despite their apparent simplicity, an untrained user may find it difficult to determine the linear model of their specific problem. We envisage the creation of a goal-oriented conversational agent that will engage in conversation with the user to elicit all information required so that a subsequent agent can generate the linear model. In this paper, we present an approach for the generation of sample dialogues that can be used to develop and train such a conversational agent. Using prompt engineering, we develop two agents that "talk" to each other, one acting as the conversational agent, and the other acting as the user. Using a set of text descriptions of linear problems from NL4Opt available to the user only, the agent and the user engage in conversation until the agent has retrieved all key information from the original problem description. We also propose an extrinsic evaluation of the dialogues by assessing how well the summaries generated by the dialogues match the original problem descriptions. We conduct human and automatic evaluations, including an evaluation approach that uses GPT-4 to mimic the human evaluation metrics. The evaluation results show an overall good quality of the dialogues, though research is still needed to improve the quality of the GPT-4 evaluation metrics. The resulting dialogues, including the human annotations of a subset, are available to the research community. The conversational agent used for the generation of the dialogues can be used as a baseline.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (13)
  1. Applied Integer Programming: Modeling and Solution.
  2. GPTScore: Evaluate as you desire.
  3. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  4. How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2122–2132, Austin, Texas. Association for Computational Linguistics.
  5. G-Eval: NLG evaluation using GPT-4 with better human alignment.
  6. Ask what’s missing and what’s useful: Improving clarification question generation using global knowledge. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4300–4312, Online. Association for Computational Linguistics.
  7. OpenAI. 2023. GPT-4 technical report. ArXiv, abs/2303.08774.
  8. Stay hungry, stay focused: Generating informative and specific questions in information-seeking conversations. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 25–40, Online. Association for Computational Linguistics.
  9. Augmenting operations research with auto-formulation of optimization models from problem descriptions. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 29–62, Abu Dhabi, UAE. Association for Computational Linguistics.
  10. NL4Opt competition: Formulating optimization problems based on their natural language descriptions.
  11. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics, 16(1):138.
  12. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
  13. BERTScore: Evaluating text generation with BERT.
Citations (14)

Summary

  • The paper introduces a dual-agent framework leveraging GPT-4 for synthesizing dialogues that guide non-experts in constructing linear programming models.
  • It employs a QG Agent and a QA Agent to simulate 476 dialogue interactions that accurately capture key LP problem details.
  • Evaluation using human metrics and GPT-4 chain-of-thought prompting validates the dialogues’ informativeness and readability while highlighting areas for prompt improvement.

Synthetic Dialogue Dataset Generation using LLM Agents

Introduction

The paper "Synthetic Dialogue Dataset Generation using LLM Agents" focuses on addressing the challenges faced by individuals without specialized mathematical backgrounds in formulating linear programming (LP) models. LP is widely used in real-world applications such as resource allocation, but its broader utilization is hampered by the difficulty non-experts have in defining LP models. To tackle this, the paper proposes a goal-oriented conversational agent that assists users in constructing accurate linear models by engaging them in dialogues. The research centers on generating synthetic dialogues for training and evaluating such an agent, using multiple evaluation methods to assess dialogue quality.

Methodology

The core methodology involves leveraging prompt engineering to construct two LLM-based agents: a Question Generation (QG) Agent and a Question Answering (QA) Agent. The QG Agent asks questions to elicit essential information about the LP problem, simulating a goal-oriented conversational agent, while the QA Agent simulates the user, providing information based on predefined problem statements from the NL4Opt dataset. Importantly, the QG Agent lacks direct access to the problem statement, relying solely on the QA Agent's answers for information.

This dual-agent setup facilitates simulated dialogues wherein the QG Agent iteratively gathers the information necessary to construct a valid LP model. The system uses OpenAI's GPT-4 API to automate dialogue generation, creating a dataset of synthetic dialogues. The dataset includes 476 dialogues, with 28 manually annotated for human evaluation, thereby contributing a valuable resource to the research community.

Evaluation Strategy

The evaluation of generated dialogues involves both human and automated methods to ensure robustness. The extrinsic evaluation assesses how well the dialogues' summaries align with the original problem descriptions. Human evaluation metrics include Information Recall, Information Precision, Information Repetition, and Readability. Automated metrics involve traditional measures like ROUGE and BERTScore, along with a novel evaluation approach using GPT-4 to simulate human judgement. The GPT-4-based evaluation employs "chain-of-thought" prompting to gauge the accuracy and comprehensiveness of the dialogue summaries.

Results

Preliminary results indicate that the synthetic dialogues closely resemble the original problem descriptions, verifying the dialogue generation approach's effectiveness. Human evaluators report high scores in terms of summary informativeness and readability. However, slight discrepancies in Information Repetition were noted. Automated evaluation metrics show a reasonable correlation with human assessments, particularly for ROUGE-L and BERTScore Precision, aligning well with human Information Precision ratings.

The use of GPT-4 for automatic evaluation demonstrates competitive results, albeit with a slight tendency to overrate compared to human evaluators. Further refinements in prompt design are required to enhance alignment with human evaluators.

Implications and Future Work

The research implies significant practical applications in automating LP model formulation, potentially increasing the accessibility of LP methodologies to non-experts. The dialogue generation and evaluation frameworks presented can be adapted for other optimization model contexts, broadening their utility.

Future work includes refining GPT-4 prompts for better alignment with human evaluations, and conducting detailed dialogue turn-level analyses to gain deeper insights into the dialogue generation process. Additionally, expanding evaluation methods tailored specifically to LP modeling tasks will augment the dataset's applicability in more diverse problem-solving scenarios.

Conclusion

The paper contributes a novel dataset and framework for synthetic dialogue generation, enhancing the development of conversational agents for LP model formulation. It demonstrates effective dialogue synthesis using LLM agents, supporting broader applications in automated optimization modeling. Further improvements in evaluation techniques and adaptation to diverse problem types hold promise for ongoing advancements in this domain.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.