Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison (2404.01015v2)

Published 1 Apr 2024 in cs.CL

Abstract: Building a reliable and automated evaluation metric is a necessary but challenging problem for open-domain dialogue systems. Recent studies proposed evaluation metrics that assess generated responses by considering their relevance to previous dialogue histories. Although effective, these metrics evaluate individual responses directly rather than considering their relative quality compared to other responses. To handle this, we propose PairEval, a novel dialogue evaluation metric for assessing responses by comparing their quality against responses in different conversations. PairEval is built on top of open-sourced and moderate-size LLMs, and we make them specialized in pairwise comparison between dialogue responses. Extensive experiments on multiple benchmarks demonstrate that our metric exhibits a higher correlation with human judgments than baseline metrics. We also find that the proposed comparative metric is more robust in detecting common failures from open-domain dialogue systems, including repetition and speaker insensitivity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  2. Fairness in recommendation ranking through pairwise comparisons. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp.  2212–2220, 2019.
  3. Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009.
  4. Elo uncovered: Robustness and best practices in language model evaluation. arXiv preprint arXiv:2311.17295, 2023.
  5. Open-domain dialog evaluation using follow-ups likelihood. In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, and Seung-Hoon Na (eds.), Proceedings of the 29th International Conference on Computational Linguistics, pp.  496–504, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics. URL https://aclanthology.org/2022.coling-1.40.
  6. Compression, transduction, and creation: A unified framework for evaluating natural language generation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  7580–7605, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.599. URL https://aclanthology.org/2021.emnlp-main.599.
  7. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
  8. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  9. The second conversational intelligence challenge (convai2). In The NeurIPS’18 Competition, pp.  187–208. Springer, 2020.
  10. Pairwise preference learning and ranking. In European conference on machine learning, pp.  145–156. Springer, 2003.
  11. Better automatic evaluation of open-domain dialogue systems with contextualized embeddings. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pp.  82–89, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-2310. URL https://aclanthology.org/W19-2310.
  12. Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations. In Proc. Interspeech 2019, pp.  1891–1895, 2019. doi: 10.21437/Interspeech.2019-3079. URL http://dx.doi.org/10.21437/Interspeech.2019-3079.
  13. On calibration of modern neural networks. In International conference on machine learning, pp.  1321–1330. PMLR, 2017.
  14. Synthesizing adversarial negative responses for robust response ranking and evaluation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  3867–3883, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.338. URL https://aclanthology.org/2021.findings-acl.338.
  15. Instructdial: Improving zero and few-shot generalization in dialogue through instruction tuning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  505–525, 2022.
  16. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  17. GRADE: Automatic graph-enhanced coherence metric for evaluating open-domain dialogue systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  9230–9240, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.742. URL https://aclanthology.org/2020.emnlp-main.742.
  18. Maurice G Kendall. Rank correlation methods. new york: Hafner, 1955. Manuscript received 3/30, 65, 1955.
  19. Explaining dialogue evaluation metrics using adversarial behavioral analysis. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  5871–5883, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.430. URL https://aclanthology.org/2022.naacl-main.430.
  20. Pneg: Prompt-based negative response generation for dialogue response selection task. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  10692–10703, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.733.
  21. DailyDialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  986–995, Taipei, Taiwan, November 2017. Asian Federation of Natural Language Processing. URL https://aclanthology.org/I17-1099.
  22. Holistic evaluation of language models. Transactions on Machine Learning Research, 2023.
  23. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.  74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
  24. LLM-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. In Yun-Nung Chen and Abhinav Rastogi (eds.), Proceedings of the 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023), pp.  47–58, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.nlp4convai-1.5. URL https://aclanthology.org/2023.nlp4convai-1.5.
  25. WANLI: Worker and AI collaboration for natural language inference dataset creation. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  6826–6847, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.508. URL https://aclanthology.org/2022.findings-emnlp.508.
  26. How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.  2122–2132, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1230. URL https://aclanthology.org/D16-1230.
  27. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  28. LLM comparative assessment: Zero-shot NLG evaluation through pairwise comparisons using large language models. In Yvette Graham and Matthew Purver (eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  139–151, St. Julian’s, Malta, March 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.eacl-long.8.
  29. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  30. Towards an automatic Turing test: Learning to evaluate dialogue responses. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1116–1126, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1103. URL https://aclanthology.org/P17-1103.
  31. USR: An unsupervised and reference free evaluation metric for dialog generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  681–707, Online, July 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.64. URL https://aclanthology.org/2020.acl-main.64.
  32. Unsupervised evaluation of interactive dialog with dialogpt. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp.  225–235, 2020b.
  33. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  34. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp.  311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://aclanthology.org/P02-1040.
  35. Generating negative samples by manipulating golden responses for unsupervised learning of a response evaluation model. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  1525–1534, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.120. URL https://aclanthology.org/2021.naacl-main.120.
  36. DEnsity: Open-domain dialogue evaluation metric using density estimation. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  14222–14236, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.896. URL https://aclanthology.org/2023.findings-acl.896.
  37. Large language models are effective text rankers with pairwise ranking prompting. arXiv preprint arXiv:2306.17563, 2023.
  38. Improving dialog evaluation with a multi-reference adversarial dataset and large scale pretraining. Transactions of the Association for Computational Linguistics, 8:810–827, 2020. doi: 10.1162/tacl˙a˙00347. URL https://aclanthology.org/2020.tacl-1.52.
  39. Evaluating dialogue generation systems via response selection. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  593–599, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.55. URL https://aclanthology.org/2020.acl-main.55.
  40. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7881–7892, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.704. URL https://aclanthology.org/2020.acl-main.704.
  41. Ruber: An unsupervised method for automatic evaluation of open-domain dialog systems. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  42. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  43. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023.
  44. Reinforcement learning to rank with pairwise policy gradient. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.  509–518, 2020.
  45. A comprehensive assessment of dialog evaluation metrics. In Wei Wei, Bo Dai, Tuo Zhao, Lihong Li, Diyi Yang, Yun-Nung Chen, Y-Lan Boureau, Asli Celikyilmaz, Alborz Geramifard, Aman Ahuja, and Haoming Jiang (eds.), The First Workshop on Evaluations and Assessments of Neural Conversation Systems, pp.  15–33, Online, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eancs-1.3. URL https://aclanthology.org/2021.eancs-1.3.
  46. GPT3Mix: Leveraging large-scale language models for text augmentation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, pp.  2225–2239, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.192. URL https://aclanthology.org/2021.findings-emnlp.192.
  47. Jerrold H Zar. Spearman rank correlation. Encyclopedia of Biostatistics, 7, 2005.
  48. DynaEval: Unifying turn and dialogue level evaluation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  5676–5689, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.441. URL https://aclanthology.org/2021.acl-long.441.
  49. A comprehensive analysis of the effectiveness of large language models as automatic dialogue evaluators. arXiv preprint arXiv:2312.15407, 2023.
  50. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2019.
  51. Dialogpt: Large-scale generative pre-training for conversational response generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp.  270–278, 2020.
  52. Designing precise and robust dialogue response evaluators. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  26–33, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.4. URL https://aclanthology.org/2020.acl-main.4.
  53. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
  54. Towards a unified multi-dimensional evaluator for text generation. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  2023–2038, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.131. URL https://aclanthology.org/2022.emnlp-main.131.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. ChaeHun Park (15 papers)
  2. Minseok Choi (35 papers)
  3. Dohyun Lee (6 papers)
  4. Jaegul Choo (161 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com