DEBATE: Devil's Advocate-Based Assessment and Text Evaluation (2405.09935v2)
Abstract: As natural language generation (NLG) models have become prevalent, systematically assessing the quality of machine-generated texts has become increasingly important. Recent studies introduce LLM-based evaluators that operate as reference-free metrics, demonstrating their capability to adeptly handle novel tasks. However, these models generally rely on a single-agent approach, which, we argue, introduces an inherent limit to their performance. This is because there exist biases in LLM agent's responses, including preferences for certain text structure or content. In this work, we propose DEBATE, an NLG evaluation framework based on multi-agent scoring system augmented with a concept of Devil's Advocate. Within the framework, one agent is instructed to criticize other agents' arguments, potentially resolving the bias in LLM agent's answers. DEBATE substantially outperforms the previous state-of-the-art methods in two meta-evaluation benchmarks in NLG evaluation, SummEval and TopicalChat. We also show that the extensiveness of debates among agents and the persona of an agent can influence the performance of evaluators.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Cheng-Han Chiang and Hung-yi Lee. 2023a. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937.
- Cheng-Han Chiang and Hung-yi Lee. 2023b. A closer look into automatic evaluation using large language models. arXiv preprint arXiv:2310.05657.
- Cheng-Han Chiang and Hung-yi Lee. 2023c. A closer look into using large language models for automatic evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8928–8942, Singapore. Association for Computational Linguistics.
- Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325.
- Arthur R. Edwards. 2002. The moderator as an emerging democratic intermediary: The role of the moderator in internet discussions about public issues. Information Polity 7 (2002) 3-20.
- Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computationa Linguistics (2021).
- Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
- Irving L Janis. 2008. Groupthink. IEEE Engineering Management Review, 36(1):36.
- Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745, 1(10).
- Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491.
- Benchmarking cognitive biases in large language models as evaluators. arXiv preprint arXiv:2309.17012.
- Leveraging large language models for nlg evaluation: A survey. arXiv preprint arXiv:2401.07103.
- Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Yen-Ting Lin and Yun-Nung Chen. 2023. Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. arXiv preprint arXiv:2305.13711.
- G-eval: NLG evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics.
- Colin MacDougall and Frances Baum. 1997. The devil’s advocate: A strategy to avoid groupthink and stimulate discussion in focus groups. Qualitative health research, 7(4):532–541.
- Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.
- Shikib Mehri and Maxine Eskenazi. 2020. USR: An unsupervised and reference free evaluation metric for dialog generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 681–707, Online. Association for Computational Linguistics.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Hadeel Saadany and Constantin Orasan. 2021. BLEU, METEOR, BERTScore: Evaluation of metrics performance in assessing critical translation errors in sentiment-oriented text. In Proceedings of the Translation and Interpreting Technology Online Conference, pages 48–56, Held Online. INCOMA Ltd.
- A survey of evaluation metrics used for nlg systems. ACM Computing Surveys (CSUR), 55(2):1–39.
- Large language models are not yet human-level evaluators for abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4215–4233.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
- A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155.
- An empirical study on challenging math problem solving with gpt-4. arXiv preprint arXiv:2306.01337.
- The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864.
- Bartscore: Evaluating generated text as text generation. In Proceedings of the 2021 Conference on Neural Information Processing Systems.
- Bertscore: Eval- uating text generation with bert. In https://doi.org/10.48550/arXiv.1904.09675.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 563–578, Hong Kong, China. Association for Computational Linguistics.
- Towards a unified multi-dimensional evaluator for text generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.
- Alex Kim (20 papers)
- Keonwoo Kim (10 papers)
- Sangwon Yoon (7 papers)