Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues (2310.13650v1)

Published 20 Oct 2023 in cs.CL

Abstract: Interacting with human via high-quality multi-turn dialogues is a key feature of LLMs. However, human-based evaluation of such capability involves intensive manual labor. This report provides a preliminary evaluation of existing LLMs for human-style multi-turn chatting, through an LLM-based approach. We start from real-world human dialogues and keep the very first utterances as the ChatSEED. Then we prompt LLMs to generate a full multi-turn dialogue (tens of utterances) based on the ChatSEED, utterance by utterance. Finally, we adopt state-of-the-art LLMs (GPT-4, \etc) as the judge to evaluate the generated dialogues. With different evaluation protocols, we come to substantially identical conclusions. We find that GPT-4 can generate human-style multi-turn dialogues with impressive quality, significantly outperforms its counterparts. It's difficult for a discriminator to distinguish between GPT-4 generated dialogues and human dialogues. In contrast, other LLMs struggle to generate multi-turn dialogues of satisfactory quality due to poor instruction-following capability, tendency to generate lengthy utterances, or limited general capability. All data and codes will be provided in https://github.com/open-compass/BotChat/ and we hope they can serve as a valuable resource for evaluating multi-turn chatting capabilities of LLMs.

Evaluating LLMs' Capabilities in Multi-Turn Dialogues: Insights from the BotChat Framework

Overview

The paper entitled "BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues" introduces an automated evaluation method for assessing the multi-turn conversational abilities of contemporary LLMs. Recognizing the impracticality and resource intensity of human-based evaluations, the authors propose a framework called BotChat. This framework leverages LLMs themselves to both generate multi-turn dialogues and evaluate their quality, offering a unique, labor-efficient means of testing conversational models. The principal focus is on comparing the performance of various LLMs, including the state-of-the-art model GPT-4, across different multi-turn dialogue scenarios.

Methodology and Implementation

BotChat operates through two primary stages: dialogue generation and quality assessment. The procedure begins with the extraction of initial dialogues, referred to as ChatSEEDs, from real-world conversation datasets. These seeds serve as the foundation from which LLMs generate full-fledged multi-turn dialogues. The generation process employs a structured, turn-by-turn format, with LLMs prompted to emulate human-like dialogue behavior.

Evaluation occurs in three distinct forms: Unitary Evaluation (UniEval), BotChat Arena, and Ground-Truth Evaluation (GTEval). In UniEval, a judge LLM independently assesses each generated dialogue to ascertain its human-likeness. BotChat Arena introduces a competitive framework where two dialogues generated by different LLMs are compared to determine which better mimics human conversation. Finally, GTEval compares LLM-generated dialogues with ground truth human dialogues from datasets, providing a benchmark for assessing the degree of fidelity to human interaction.

Key Findings

The paper presents several notable findings from their expansive evaluation across 14 LLMs. GPT-4 consistently excels, producing dialogues that are difficult to distinguish from human conversations. This model stands out not only for its superior generation of human-like dialogue but also for its ability to maintain quality over longer interactions. Other LLMs, such as Vicuna-13B and InternLM-20B, also show commendable performance, albeit falling short of GPT-4's standard in extended dialogues.

Several open-source LLMs deliver satisfactory results in short dialogue scenarios; however, their performance declines significantly with increased dialogue length. This degradation highlights issues in maintaining contextual relevance and natural flow over multiple turns. Factors contributing to poorer performance include excessive verbosity, tendency for AI assistants to reveal their non-human nature, and contextually inconsistent or repetitive responses.

Implications and Future Directions

The implications of this research are far-reaching. From a practical standpoint, BotChat offers a scalable, efficient alternative to human labor-intensive evaluations, enabling rapid assessment of LLM dialogue capabilities across diverse models. This methodology could evolve into a standard tool for benchmarking future LLMs and variant updates.

From a theoretical perspective, understanding the nuances in dialogue generation quality across models guides research into refining LLM architectures and training paradigms. Addressing identified flaws, such as contextual inconsistency and verbosity, could enhance the development of LLMs adept in generating seamless, human-like discourse.

Future research may focus on refining evaluation strategies, potentially incorporating more nuanced aspects of human dialogue such as emotional intelligence and situational awareness. Additionally, expanding BotChat's evaluation protocols to include multilingual dialogues could advance capabilities in global communication contexts.

Conclusion

This paper provides a robust framework for evaluating LLM dialogue capabilities and highlights GPT-4's exceptional proficiency in maintaining human-style conversations over multiple turns. The authors underscore challenges faced by lesser-performing models, offering a roadmap for ongoing enhancements in LLMs’ conversational skills. By employing LLMs for both dialogue generation and assessment, BotChat represents a significant shift towards more efficient and automated evaluation methodologies in AI research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977, 2020.
  2. Qwen technical report, 2023.
  3. Baichuan. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023. URL https://arxiv.org/abs/2309.10305.
  4. Emergent autonomous scientific research capabilities of large language models. arXiv preprint arXiv:2304.05332, 2023.
  5. Chemcrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv:2304.05376, 2023.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  7. Mutual: A dataset for multi-turn dialogue reasoning. arXiv preprint arXiv:2004.04494, 2020.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  9. Wizard of wikipedia: Knowledge-powered conversational agents. arXiv preprint arXiv:1811.01241, 2018.
  10. How” open” are the conversations with open-domain chatbots? a proposal for speech event based evaluation. arXiv preprint arXiv:2211.13560, 2022.
  11. Elo, A. E. The proposed uscf rating system, its development, theory, and applications. Chess Life, 22(8):242–247, 1967.
  12. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. arXiv preprint arXiv:1906.01749, 2019.
  13. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166, 2023.
  14. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  15. Efficient attentions for long document summarization. arXiv preprint arXiv:2104.02112, 2021.
  16. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023.
  17. Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745, 2023.
  18. Acute-eval: Improved dialogue evaluation with optimized questions and multi-turn comparisons. arXiv preprint arXiv:1909.03087, 2019.
  19. Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.  74–81, 2004.
  20. How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023, 2016.
  21. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  22. OpenAI. Gpt-4 technical report, 2023.
  23. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp.  311–318, 2002.
  24. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266, 2019.
  25. A survey of available corpora for building data-driven dialogue systems. arXiv preprint arXiv:1512.05742, 2015.
  26. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
  27. Team, I. Internlm: A multilingual language model with progressively enhanced capabilities, 2023.
  28. Llama: Open and efficient foundation language models, 2023a.
  29. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  30. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  31. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048, 2023.
  32. Wizardlm: Empowering large language models to follow complex instructions, 2023.
  33. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
  34. Personalizing dialogue agents: I have a dog, do you have pets too? arXiv preprint arXiv:1801.07243, 2018.
  35. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  36. A dataset for document grounded conversations. arXiv preprint arXiv:1809.07358, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Haodong Duan (55 papers)
  2. Jueqi Wei (1 paper)
  3. Chonghua Wang (2 papers)
  4. Yixiao Fang (6 papers)
  5. Songyang Zhang (116 papers)
  6. Dahua Lin (336 papers)
  7. Kai Chen (512 papers)
  8. HongWei Liu (108 papers)
Citations (10)
Youtube Logo Streamline Icon: https://streamlinehq.com