Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rethinking Response Evaluation from Interlocutor's Eye for Open-Domain Dialogue Systems (2401.02256v1)

Published 4 Jan 2024 in cs.CL

Abstract: Open-domain dialogue systems have started to engage in continuous conversations with humans. Those dialogue systems are required to be adjusted to the human interlocutor and evaluated in terms of their perspective. However, it is questionable whether the current automatic evaluation methods can approximate the interlocutor's judgments. In this study, we analyzed and examined what features are needed in an automatic response evaluator from the interlocutor's perspective. The first experiment on the Hazumi dataset revealed that interlocutor awareness plays a critical role in making automatic response evaluation correlate with the interlocutor's judgments. The second experiment using massive conversations on X (formerly Twitter) confirmed that dialogue continuity prediction can train an interlocutor-aware response evaluator without human feedback while revealing the difficulty in evaluating generated responses compared to human responses.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Keep me updated! memory management in long-term conversations. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3769–3787, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  2. Lexi: A tool for adaptive, personalized text simplification. In Proceedings of the 27th International Conference on Computational Linguistics, pages 245–258, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
  3. A personalized dialogue generator with implicit user persona detection. In Proceedings of the 29th International Conference on Computational Linguistics, pages 367–377, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  4. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  5. Alberto Díaz and Pablo Gervás. 2007. User-model based personalized summarization. Information Processing & Management, 43(6):1715–1734. Text Summarization.
  6. Don’t forget your ABC’s: Evaluating the state-of-the-art in chat-oriented dialogue systems. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15044–15071, Toronto, Canada. Association for Computational Linguistics.
  7. Bootstrapping dialog systems with word embeddings. In NIPS, modern machine learning and natural language processing workshop, volume 2.
  8. deltaBLEU: A discriminative metric for generation tasks with intrinsically diverse targets. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 445–450, Beijing, China. Association for Computational Linguistics.
  9. Dialogue response ranking training with large-scale human feedback data. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 386–395, Online. Association for Computational Linguistics.
  10. What is wrong with you?: Leveraging user sentiment for automatic dialog evaluation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 4194–4204, Dublin, Ireland. Association for Computational Linguistics.
  11. Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pages 82–89, Minneapolis, Minnesota. Association for Computational Linguistics.
  12. Predictive engagement: An efficient metric for automatic evaluation of open-domain dialogue systems. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7789–7796.
  13. Influence of user personality on dialogue task performance: A case study using a rule-based dialogue system. In Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI, pages 263–270, Online. Association for Computational Linguistics.
  14. Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, pages 379–391, Stockholm, Sweden. Association for Computational Linguistics.
  15. The dialogue breakdown detection challenge: Task description, datasets, and evaluation metrics. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3146–3150, Portorož, Slovenia. European Language Resources Association (ELRA).
  16. Achieving reliable human assessment of open-domain dialogue systems. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6416–6437, Dublin, Ireland. Association for Computational Linguistics.
  17. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference for Learning Representations.
  18. Kazunori Komatani and Shogo Okada. 2021. Multimodal human-agent dialogue corpus with annotations at utterance and dialogue levels. In 2021 9th International Conference on Affective Computing and Intelligent Interaction (ACII), pages 1–8.
  19. On recommending hashtags in twitter networks. In Social Informatics, pages 337–350, Berlin, Heidelberg. Springer Berlin Heidelberg.
  20. A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 994–1003, Berlin, Germany. Association for Computational Linguistics.
  21. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2122–2132, Austin, Texas. Association for Computational Linguistics.
  22. You impress me: Dialogue generation via mutual persona perception. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1417–1427, Online. Association for Computational Linguistics.
  23. Towards an automatic Turing test: Learning to evaluate dialogue responses. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1116–1126, Vancouver, Canada. Association for Computational Linguistics.
  24. Shikib Mehri and Maxine Eskenazi. 2020a. Unsupervised evaluation of interactive dialog with DialoGPT. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 225–235, 1st virtual meeting. Association for Computational Linguistics.
  25. Shikib Mehri and Maxine Eskenazi. 2020b. USR: An unsupervised and reference free evaluation metric for dialog generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 681–707, Online. Association for Computational Linguistics.
  26. Shachar Mirkin and Jean-Luc Meunier. 2015. Personalized machine translation: Predicting translational preferences. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2019–2025, Lisbon, Portugal. Association for Computational Linguistics.
  27. A web recommendation system considering sequential information. Decision Support Systems, 75:1–10.
  28. Understanding how people rate their conversations. In Conversational AI for Natural Human-Centric Interaction, the 12th International Workshop on Spoken Dialogue System Technology, IWSDS 2021,, pages 179–189, Singapore. Springer Singapore.
  29. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  30. Deconstruct to reconstruct a configurable evaluation metric for open-domain dialogue systems. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4164–4178, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  31. Language Models Are Unsupervised Multitask Learners. volume 1, page 9.
  32. Modeling situations in neural chat bots. In Proceedings of ACL 2017, Student Research Workshop, pages 120–127, Vancouver, Canada. Association for Computational Linguistics.
  33. Empirical analysis of training strategies of transformer-based japanese chit-chat systems.
  34. Effective dialogue-context retriever for long-term open-domain conversation. In The 13th International Workshop on Spoken Dialogue Systems Technology, Los Angeles.
  35. RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems. In AAAI Conference on Artificial Intelligence, pages 722–729.
  36. uBLEU: Uncertainty-aware automatic evaluation method for open-domain dialogue systems. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 199–206, Online. Association for Computational Linguistics.
  37. EnDex: Evaluation of dialogue engagingness at scale. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4884–4893, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  38. Beyond goldfish memory: Long-term open-domain conversation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5180–5197, Dublin, Ireland. Association for Computational Linguistics.
  39. Long time no see! open-domain conversation with long-term persona memory. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2639–2650, Dublin, Ireland. Association for Computational Linguistics.
  40. A comprehensive assessment of dialog evaluation metrics. In The First Workshop on Evaluations and Assessments of Neural Conversation Systems, pages 15–33, Online. Association for Computational Linguistics.
  41. Deep am-fm: Toolkit for automatic dialogue evaluation. In Luis Fernando D’Haro, Zoraida Callejas, and Satoshi Nakamura, editors, Conversational Dialogue Systems for the Next Decade, pages 53–69. Springer Singapore, Singapore.
  42. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213, Melbourne, Australia. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yuma Tsuta (1 paper)
  2. Naoki Yoshinaga (17 papers)
  3. Shoetsu Sato (4 papers)
  4. Masashi Toyoda (12 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets