Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues (2402.14762v3)

Published 22 Feb 2024 in cs.CL and cs.AI

Abstract: The advent of LLMs has drastically enhanced dialogue systems. However, comprehensively evaluating the dialogue abilities of LLMs remains a challenge. Previous benchmarks have primarily focused on single-turn dialogues or provided coarse-grained and incomplete assessments of multi-turn dialogues, overlooking the complexity and fine-grained nuances of real-life dialogues. To address this issue, we introduce MT-Bench-101, specifically designed to evaluate the fine-grained abilities of LLMs in multi-turn dialogues. By conducting a detailed analysis of real multi-turn dialogue data, we construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks. We then evaluate 21 popular LLMs based on MT-Bench-101, conducting comprehensive analyses from both ability and task perspectives and observing differing trends in LLMs performance across dialogue turns within various tasks. Further analysis indicates that neither utilizing common alignment techniques nor chat-specific designs has led to obvious enhancements in the multi-turn abilities of LLMs. Extensive case studies suggest that our designed tasks accurately assess the corresponding multi-turn abilities. The data and code are available at \url{https://github.com/mtbench101/mt-bench-101}.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Ge Bai (21 papers)
  2. Jie Liu (492 papers)
  3. Xingyuan Bu (24 papers)
  4. Yancheng He (30 papers)
  5. Jiaheng Liu (100 papers)
  6. Zhanhui Zhou (13 papers)
  7. Zhuoran Lin (3 papers)
  8. Wenbo Su (36 papers)
  9. Tiezheng Ge (46 papers)
  10. Bo Zheng (205 papers)
  11. Wanli Ouyang (358 papers)
Citations (33)
X Twitter Logo Streamline Icon: https://streamlinehq.com