Debatrix: Multi-dimensional Debate Judge with Iterative Chronological Analysis Based on LLM (2403.08010v3)
Abstract: How can we construct an automated debate judge to evaluate an extensive, vibrant, multi-turn debate? This task is challenging, as judging a debate involves grappling with lengthy texts, intricate argument relationships, and multi-dimensional assessments. At the same time, current research mainly focuses on short dialogues, rarely touching upon the evaluation of an entire debate. In this paper, by leveraging LLMs, we propose Debatrix, which makes the analysis and assessment of multi-turn debates more aligned with majority preferences. Specifically, Debatrix features a vertical, iterative chronological analysis and a horizontal, multi-dimensional evaluation collaboration. To align with real-world debate scenarios, we introduced the PanelBench benchmark, comparing our system's performance to actual debate outcomes. The findings indicate a notable enhancement over directly using LLMs for debate evaluation. Source code and benchmark data are available online at https://github.com/ljcleo/debatrix .
- Exploiting personal characteristics of debaters for predicting persuasiveness. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7067–7072, Online. Association for Computational Linguistics.
- Benchmarking foundation models with language-model-as-an-examiner.
- Chateval: Towards better llm-based evaluators through multi-agent debate.
- Exploring the potential of large language models in computational argumentation.
- Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations?
- Adrian de Wynter and Tommy Yuan. 2023. I wish to have an argument: Argumentative reasoning in large language models.
- Machine learning for utility prediction in argument-based computational persuasion. Proceedings of the AAAI Conference on Artificial Intelligence, 36(5):5592–5599.
- Gptscore: Evaluate as you desire.
- Are you convinced? choosing the more convincing evidence with a Siamese network. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 967–976, Florence, Italy. Association for Computational Linguistics.
- Look at the first sentence: Position bias in question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1109–1121, Online. Association for Computational Linguistics.
- Acute-eval: Improved dialogue evaluation with optimized questions and multi-turn comparisons.
- Prd: Peer rank and discussion improve large language model based evaluations.
- G-eval: Nlg evaluation using gpt-4 with better human alignment.
- OpenAI. 2023. Gpt-4 technical report.
- Isaac Persing and Vincent Ng. 2015. Modeling argument strength in student essays. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 543–552, Beijing, China. Association for Computational Linguistics.
- Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 28492–28518. PMLR.
- Ariel Rosenfeld and Sarit Kraus. 2016. Providing arguments in discussions on the basis of the prediction of human argumentative behavior. ACM Trans. Interact. Intell. Syst., 6(4).
- An autonomous debating system. Nature, 591(7850):379–384.
- Can i influence you? development of a scale to measure perceived persuasiveness and two studies showing the use of the scale. Frontiers in Artificial Intelligence, 2.
- Adapting healthy eating messages to personality. In Persuasive Technology: Development and Implementation of Personalized Technologies to Change Attitudes and Behaviors, pages 119–132, Cham. Springer International Publishing.
- Is argumessage effective? a critical evaluation of the persuasive message generation system. In Persuasive Technology: Development of Persuasive and Behavior Change Support Systems: 14th International Conference, PERSUASIVE 2019, Limassol, Cyprus, April 9–11, 2019, Proceedings, volume 11433, pages 87–99. Springer.
- Is chatgpt a good nlg evaluator? a preliminary study.
- Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization.
- World Universities Debating Council. 2023. World Universities Debating Championships Debating & Judging Manual.
- Large language models are diverse role-players for summarization evaluation.
- Conversational flow in Oxford-style debates. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 136–141, San Diego, California. Association for Computational Linguistics.
- Judging llm-as-a-judge with mt-bench and chatbot arena.
- Towards a unified multi-dimensional evaluator for text generation.
- Judgelm: Fine-tuned large language models are scalable judges.