Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs (2404.12994v2)

Published 19 Apr 2024 in cs.IR and cs.CL

Abstract: In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback. In a conversational setting such signals are usually unavailable due to the nature of the interactions, and, instead, the evaluation often relies on crowdsourced evaluation labels. The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied. We focus on how the evaluation of task-oriented dialogue systems (TDSs), is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated. We explore and compare two methodologies for assessing TDSs: one includes the user's follow-up utterance and one without. We use both crowdworkers and LLMs as annotators to assess system responses across four aspects: relevance, usefulness, interestingness, and explanation quality. Our findings indicate that there is a distinct difference in ratings assigned by both annotator groups in the two setups, indicating user feedback does influence system evaluation. Workers are more susceptible to user feedback on usefulness and interestingness compared to LLMs on interestingness and relevance. User feedback leads to a more personalized assessment of usefulness by workers, aligning closely with the user's explicit feedback. Additionally, in cases of ambiguous or complex user requests, user feedback improves agreement among crowdworkers. These findings emphasize the significance of user feedback in refining system evaluations and suggest the potential for automated feedback integration in future research. We publicly release the annotated data to foster research in this area.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Asking Clarifying Questions in Open-Domain Information-Seeking Conversations. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris, France, July 21-25, 2019, Benjamin Piwowarski, Max Chevalier, Éric Gaussier, Yoelle Maarek, Jian-Yun Nie, and Falk Scholer (Eds.). ACM, 475–484. https://doi.org/10.1145/3331184.3331265
  2. Crowdsourcing for Relevance Evaluation. SIGIR Forum 42, 2 (2008), 9–15. https://doi.org/10.1145/1480506.1480508
  3. Amazon Mechanical Turk. 2023. https://www.mturk.com.
  4. Leif Azzopardi. 2021. Cognitive Biases in Search: A Review and Reflection of Cognitive Biases in Information Retrieval. In CHIIR ’21: ACM SIGIR Conference on Human Information Interaction and Retrieval, Canberra, ACT, Australia, March 14-19, 2021, Falk Scholer, Paul Thomas, David Elsweiler, Hideo Joho, Noriko Kando, and Catherine Smith (Eds.). ACM, 27–37. https://doi.org/10.1145/3406522.3446023
  5. Krisztian Balog and Filip Radlinski. 2020. Measuring Recommendation Explanation Quality: The Conflicting Goals of Explanations. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020, Jimmy X. Huang, Yi Chang, Xueqi Cheng, Jaap Kamps, Vanessa Murdock, Ji-Rong Wen, and Yiqun Liu (Eds.). ACM, 329–338. https://doi.org/10.1145/3397271.3401032
  6. Crowdsourcing Truthfulness: The Impact of Judgment Scale and Assessor Bias. In Advances in Information Retrieval - 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14-17, 2020, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 12036), Joemon M. Jose, Emine Yilmaz, João Magalhães, Pablo Castells, Nicola Ferro, Mário J. Silva, and Flávio Martins (Eds.). Springer, 207–214. https://doi.org/10.1007/978-3-030-45442-5_26
  7. Multi-domain Conversation Quality Evaluation via User Satisfaction Estimation. CoRR abs/1911.08567 (2019). arXiv:1911.08567 http://arxiv.org/abs/1911.08567
  8. Joint Turn and Dialogue level User Satisfaction Estimation on Mulit-Domain Conversations. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020 (Findings of ACL, Vol. EMNLP 2020), Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 3897–3909. https://doi.org/10.18653/V1/2020.FINDINGS-EMNLP.347
  9. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
  10. A Survey on Evaluation of Large Language Models. CoRR abs/2307.03109 (2023). https://doi.org/10.48550/ARXIV.2307.03109 arXiv:2307.03109
  11. Bias and Debias in Recommender System: A Survey and Future Directions. ACM Trans. Inf. Syst. 41, 3 (2023), 67:1–67:39. https://doi.org/10.1145/3564284
  12. Offline and Online Satisfaction Prediction in Open-Domain Conversational Systems. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Beijing, China, November 3-7, 2019, Wenwu Zhu, Dacheng Tao, Xueqi Cheng, Peng Cui, Elke A. Rundensteiner, David Carmel, Qi He, and Jeffrey Xu Yu (Eds.). ACM, 1281–1290. https://doi.org/10.1145/3357384.3358047
  13. Factors determining the performance of indexing systems. In ASLIB Cranfield research project.
  14. Survey on evaluation methods for dialogue systems. Artificial Intelligence Review 54 (2020), 755–810.
  15. Is GPT-3 a Good Data Annotator?. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, 11173–11195. https://doi.org/10.18653/V1/2023.ACL-LONG.626
  16. Carsten Eickhoff. 2018. Cognitive Biases in Crowdsourcing. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (Marina Del Rey, CA, USA) (WSDM ’18). Association for Computing Machinery, New York, NY, USA, 162–170. https://doi.org/10.1145/3159652.3159654
  17. Clarity is a Worthwhile Quality: On the Role of Task Clarity in Microtask Crowdsourcing. In Proceedings of the 28th ACM Conference on Hypertext and Social Media, HT 2017, Prague, Czech Republic, July 4-7, 2017, Peter Dolog, Peter Vojtás, Francesco Bonchi, and Denis Helic (Eds.). ACM, 5–14. https://doi.org/10.1145/3078714.3078715
  18. ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks. CoRR abs/2303.15056 (2023). https://doi.org/10.48550/ARXIV.2303.15056 arXiv:2303.15056
  19. Towards Explainable Conversational Recommender Systems. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, July 23-27, 2023, Hsin-Hsi Chen, Wei-Jou (Edward) Duh, Hen-Hsen Huang, Makoto P. Kato, Josiane Mothe, and Barbara Poblete (Eds.). ACM, 2786–2795. https://doi.org/10.1145/3539618.3591884
  20. Crowd Worker Strategies in Relevance Judgment Tasks. In Proceedings of the 13th International Conference on Web Search and Data Mining (Houston, TX, USA) (WSDM ’20). Association for Computing Machinery, New York, NY, USA, 241–249. https://doi.org/10.1145/3336191.3371857
  21. All Those Wasted Hours: On Task Abandonment in Crowdsourcing. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM 2019, Melbourne, VIC, Australia, February 11-15, 2019, J. Shane Culpepper, Alistair Moffat, Paul N. Bennett, and Kristina Lerman (Eds.). ACM, 321–329. https://doi.org/10.1145/3289600.3291035
  22. On Transforming Relevance Scales. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Beijing, China, November 3-7, 2019, Wenwu Zhu, Dacheng Tao, Xueqi Cheng, Peng Cui, Elke A. Rundensteiner, David Carmel, Qi He, and Jeffrey Xu Yu (Eds.). ACM, 39–48. https://doi.org/10.1145/3357384.3357988
  23. AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators. CoRR abs/2303.16854 (2023). https://doi.org/10.48550/ARXIV.2303.16854 arXiv:2303.16854
  24. Understanding and Mitigating Worker Biases in the Crowdsourced Collection of Subjective Judgments. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI 2019, Glasgow, Scotland, UK, May 04-09, 2019, Stephen A. Brewster, Geraldine Fitzpatrick, Anna L. Cox, and Vassilis Kostakos (Eds.). ACM, 407. https://doi.org/10.1145/3290605.3300637
  25. Comparing In Situ and Multidimensional Relevance Judgments. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 7-11, 2017, Noriko Kando, Tetsuya Sakai, Hideo Joho, Hang Li, Arjen P. de Vries, and Ryen W. White (Eds.). ACM, 405–414. https://doi.org/10.1145/3077136.3080840
  26. An analysis of human factors and label accuracy in crowdsourcing relevance judgments. Inf. Retr. 16, 2 (2013), 138–178. https://doi.org/10.1007/s10791-012-9205-0
  27. Explicit In Situ User Feedback for Web Search Results. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Italy, July 17-21, 2016, Raffaele Perego, Fabrizio Sebastiani, Javed A. Aslam, Ian Ruthven, and Justin Zobel (Eds.). ACM, 829–832. https://doi.org/10.1145/2911451.2914754
  28. Modeling dwell time to predict click-level satisfaction. In Seventh ACM International Conference on Web Search and Data Mining, WSDM 2014, New York, NY, USA, February 24-28, 2014, Ben Carterette, Fernando Diaz, Carlos Castillo, and Donald Metzler (Eds.). ACM, 193–202. https://doi.org/10.1145/2556195.2556220
  29. Towards Deep Conversational Recommendations. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada. 9748–9758. https://proceedings.neurips.cc/paper/2018/hash/800de15c79c8d840f4e78d3af937d4d4-Abstract.html
  30. ”Satisfaction with Failure” or ”Unsatisfied Success”: Investigating the Relationship between Search Success and User Satisfaction. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, April 23-27, 2018, Pierre-Antoine Champin, Fabien Gandon, Mounia Lalmas, and Panagiotis G. Ipeirotis (Eds.). ACM, 1533–1542. https://doi.org/10.1145/3178876.3186065
  31. Different Users, Different Opinions: Predicting Search Satisfaction with Mouse Movement Information. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, August 9-13, 2015, Ricardo Baeza-Yates, Mounia Lalmas, Alistair Moffat, and Berthier A. Ribeiro-Neto (Eds.). ACM, 493–502. https://doi.org/10.1145/2766462.2767721
  32. On Crowdsourcing Relevance Magnitudes for Information Retrieval Evaluation. ACM Trans. Inf. Syst. 35, 3 (2017), 19:1–19:32. https://doi.org/10.1145/3002172
  33. Investigating Result Usefulness in Mobile Search. In Advances in Information Retrieval - 40th European Conference on IR Research, ECIR 2018, Grenoble, France, March 26-29, 2018, Proceedings (Lecture Notes in Computer Science, Vol. 10772), Gabriella Pasi, Benjamin Piwowarski, Leif Azzopardi, and Allan Hanbury (Eds.). Springer, 223–236. https://doi.org/10.1007/978-3-319-76941-7_17
  34. Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments. In Proceedings of the Fourth AAAI Conference on Human Computation and Crowdsourcing, HCOMP 2016, 30 October - 3 November, 2016, Austin, Texas, USA, Arpita Ghosh and Matthew Lease (Eds.). AAAI Press, 139–148. https://doi.org/10.1609/HCOMP.V4I1.13287
  35. Shikib Mehri and Maxine Eskenazi. 2020. USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 681–707. https://doi.org/10.18653/v1/2020.acl-main.64
  36. On Fine-Grained Relevance Scales. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, Ann Arbor, MI, USA, July 08-12, 2018, Kevyn Collins-Thompson, Qiaozhu Mei, Brian D. Davison, Yiqun Liu, and Emine Yilmaz (Eds.). ACM, 675–684. https://doi.org/10.1145/3209978.3210052
  37. Can The Crowd Identify Misinformation Objectively?: The Effects of Judgment Scale and Assessor’s Background. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020, Jimmy X. Huang, Yi Chang, Xueqi Cheng, Jaap Kamps, Vanessa Murdock, Ji-Rong Wen, and Yiqun Liu (Eds.). ACM, 439–448. https://doi.org/10.1145/3397271.3401112
  38. Studying the Effects of Cognitive Biases in Evaluation of Conversational Agents. In CHI ’20: CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, April 25-30, 2020, Regina Bernhaupt, Florian ’Floyd’ Mueller, David Verweij, Josh Andres, Joanna McGrenere, Andy Cockburn, Ignacio Avellino, Alix Goguey, Pernille Bjøn, Shengdong Zhao, Briane Paul Samson, and Rafal Kocielnik (Eds.). ACM, 1–13. https://doi.org/10.1145/3313831.3376318
  39. Understanding User Satisfaction with Task-Oriented Dialogue Systems. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (Madrid, Spain) (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 2018–2023. https://doi.org/10.1145/3477495.3531798
  40. Understanding and Predicting User Satisfaction with Conversational Recommender Systems. ACM Trans. Inf. Syst. 42, 2 (sep 2023), Article 55. https://doi.org/10.1145/3624989
  41. Context Does Matter: Implications for Crowdsourced Evaluation Labels in Task-Oriented Dialogue Systems. arXiv preprint arXiv:2404.09980 (2024).
  42. Simulating User Satisfaction for the Evaluation of Task-Oriented Dialogue Systems. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, New York, NY, USA, 2499–2506. https://doi.org/10.1145/3404835.3463241
  43. Modeling the usefulness of search results as measured by information use. Inf. Process. Manag. 56, 3 (2019), 879–894. https://doi.org/10.1016/J.IPM.2019.02.001
  44. Want To Reduce Labeling Cost? GPT-3 Can Help. In Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 4195–4205. https://doi.org/10.18653/V1/2021.FINDINGS-EMNLP.354
  45. Shuo Zhang and Krisztian Balog. 2020. Evaluating Conversational Recommender Systems via User Simulation. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash (Eds.). ACM, 1512–1520. https://doi.org/10.1145/3394486.3403202
  46. Exploring Implicit Feedback for Open Domain Conversation Generation. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, Sheila A. McIlraith and Kilian Q. Weinberger (Eds.). AAAI Press, 547–554. https://doi.org/10.1609/AAAI.V32I1.11253
  47. Yongfeng Zhang and Xu Chen. 2020. Explainable Recommendation: A Survey and New Perspectives. Found. Trends Inf. Retr. 14, 1 (2020), 1–101. https://doi.org/10.1561/1500000066
  48. Explicit factor models for explainable recommendation based on phrase-level sentiment analysis. In The 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’14, Gold Coast , QLD, Australia - July 06 - 11, 2014, Shlomo Geva, Andrew Trotman, Peter Bruza, Charles L. A. Clarke, and Kalervo Järvelin (Eds.). ACM, 83–92. https://doi.org/10.1145/2600428.2609579
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Clemencia Siro (15 papers)
  2. Mohammad Aliannejadi (85 papers)
  3. Maarten de Rijke (261 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets