Context Does Matter: Implications for Crowdsourced Evaluation Labels in Task-Oriented Dialogue Systems (2404.09980v1)
Abstract: Crowdsourced labels play a crucial role in evaluating task-oriented dialogue systems (TDSs). Obtaining high-quality and consistent ground-truth labels from annotators presents challenges. When evaluating a TDS, annotators must fully comprehend the dialogue before providing judgments. Previous studies suggest using only a portion of the dialogue context in the annotation process. However, the impact of this limitation on label quality remains unexplored. This study investigates the influence of dialogue context on annotation quality, considering the truncated context for relevance and usefulness labeling. We further propose to use LLMs to summarize the dialogue context to provide a rich and short description of the dialogue context and study the impact of doing so on the annotator's performance. Reducing context leads to more positive ratings. Conversely, providing the entire dialogue context yields higher-quality relevance ratings but introduces ambiguity in usefulness ratings. Using the first user utterance as context leads to consistent ratings, akin to those obtained using the entire dialogue, with significantly reduced annotation effort. Our findings show how task design, particularly the availability of dialogue context, affects the quality and consistency of crowdsourced evaluation labels.
- The relationship between IR effectiveness measures and user satisfaction. In Proceedings of the 30th Annual International Association for Computing Machinery SIGIR Conference on Research and Development in Information Retrieval, page 773–774, New York, NY, USA. Association for Computing Machinery.
- Crowdsourcing for relevance evaluation. SIGIR Forum, 42(2):9–15.
- Amazon Mechanical Turk. 2023. https://www.mturk.com.
- Mayla Boguslav and Kevin Bretonnel Cohen. 2017. Inter-annotator agreement and the upper limit on machine performance: Evidence from biomedical natural language processing. In MEDINFO 2017: Precision Healthcare through Informatics - Proceedings of the 16th World Congress on Medical and Health Informatics, Hangzhou, China, 21-25 August 2017, volume 245 of Studies in Health Technology and Informatics, pages 298–302. IOS Press.
- Adam Bouyamourn. 2023. Why LLMs hallucinate, and how to get (evidential) closure: Perceptual, intensional, and extensional learning for faithful natural language generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 3181–3193. Association for Computational Linguistics.
- Paweł Budzianowski and Ivan Vulić. 2019. Hello, it’s GPT-2 - How can I help you? Towards the use of pretrained language models for task-oriented dialogue systems. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 15–22, Hong Kong. Association for Computational Linguistics.
- Jean Carletta. 1996. Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2):249–254.
- A survey on evaluation of large language models. CoRR, abs/2307.03109.
- Survey on evaluation methods for dialogue systems. Artif. Intell. Rev., 54(1):755–810.
- Demographics and dynamics of mechanical turk workers. In Proceedings of the Eleventh Association for Computing Machinery International Conference on Web Search and Data Mining, WSDM 2018, Marina Del Rey, CA, USA, February 5-9, 2018, pages 135–143. Association for Computing Machinery.
- Carsten Eickhoff. 2018. Cognitive biases in crowdsourcing. In Proceedings of the Eleventh Association for Computing Machinery International Conference on Web Search and Data Mining, WSDM 2018, Marina Del Rey, CA, USA, February 5-9, 2018, pages 162–170. Association for Computing Machinery.
- Perspectives on large language models for relevance judgment. In Proceedings of the 2023 Association for Computing Machinery SIGIR International Conference on Theory of Information Retrieval, ICTIR 2023, Taipei, Taiwan, 23 July 2023, pages 39–50. Association for Computing Machinery.
- A survey on dialogue summarization: Recent advances and new frontiers. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages 5453–5460. ijcai.org.
- Better automatic evaluation of open-domain dialogue systems with contextualized embeddings. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pages 82–89, Minneapolis, Minnesota. Association for Computational Linguistics.
- GRADE: Automatic graph-enhanced coherence metric for evaluating open-domain dialogue systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9230–9240, Online. Association for Computational Linguistics.
- Understanding and mitigating worker biases in the crowdsourced collection of subjective judgments. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, page 1–12, New York, NY, USA. Association for Computing Machinery.
- Pritam Kadasi and Mayank Singh. 2023. Unveiling the multi-annotation process: Examining the influence of annotation quantity and instance difficulty on model performance. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 1371–1388. Association for Computational Linguistics.
- Gabriella Kazai. 2011. In search of quality in crowdsourcing for search engine evaluation. In Advances in Information Retrieval - 33rd European Conference on IR Research, ECIR 2011, Dublin, Ireland, April 18-21, 2011. Proceedings, volume 6611 of Lecture Notes in Computer Science, pages 165–176. Springer.
- Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking. In Proceeding of the 34th International Association for Computing Machinery SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, Beijing, China, July 25-29, 2011, pages 205–214. Association for Computing Machinery.
- Worker types and personality traits in crowdsourcing relevance labels. In Proceedings of the 20th Association for Computing Machinery International Conference on Information and Knowledge Management, CIKM ’11, page 1941–1944, New York, NY, USA. Association for Computing Machinery.
- The face of quality in crowdsourcing relevance labels: Demographics, personality and labeling accuracy. In Proceedings of the 21st Association for Computing Machinery International Conference on Information and Knowledge Management, CIKM ’12, page 2583–2586, New York, NY, USA. Association for Computing Machinery.
- An analysis of human factors and label accuracy in crowdsourcing relevance judgments. Information Retrieval, 16(2):138–178.
- Annotation sensitivity: Training data collection methods affect model performance. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 14874–14886. Association for Computational Linguistics.
- Understanding user satisfaction with intelligent assistants. In Proceedings of the 2016 Association for Computing Machinery on Conference on Human Information Interaction and Retrieval, CHIIR ’16, page 121–130, New York, NY, USA. Association for Computing Machinery.
- ACUTE-EVAL: improved dialogue evaluation with optimized questions and multi-turn comparisons. CoRR, abs/1909.03087.
- Coannotating: Uncertainty-guided work allocation between human and large language models for data annotation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 1487–1505. Association for Computational Linguistics.
- Towards deep conversational recommendations. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 9748–9758.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- When does relevance mean usefulness and user satisfaction in web search? In Proceedings of the 39th International Association for Computing Machinery SIGIR conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Italy, July 17-21, 2016, pages 463–472. Association for Computing Machinery.
- Shikib Mehri and Maxine Eskenazi. 2020. USR: An unsupervised and reference free evaluation metric for dialog generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 681–707, Online. Association for Computational Linguistics.
- RankME: Reliable human ratings for natural language generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 72–78, New Orleans, Louisiana. Association for Computational Linguistics.
- OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
- Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA. Association for Computational Linguistics.
- Positivity bias in customer satisfaction ratings. In Companion of the The Web Conference 2018 on The Web Conference 2018, WWW 2018, Lyon , France, April 23-27, 2018, pages 631–638. Association for Computing Machinery.
- Don’t blame the annotator: Bias already starts in the annotation instructions. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023, pages 1771–1781. Association for Computational Linguistics.
- Quality control in crowdsourcing based on fine-grained behavioral features. Proc. Association for Computing Machinery Hum. Comput. Interact., 5(CSCW2):442:1–442:28.
- On the effect of relevance scales in crowdsourcing relevance assessments for information retrieval evaluation. Inf. Process. Manag., 58(6):102688.
- Can the crowd identify misinformation objectively?: The effects of judgment scale and assessor’s background. In Proceedings of the 43rd International Association for Computing Machinery SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020, pages 439–448. Association for Computing Machinery.
- Studying the effects of cognitive biases in evaluation of conversational agents. In CHI ’20: CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, April 25-30, 2020, pages 1–13. Association for Computing Machinery.
- Alexander Schmitt and Stefan Ultes. 2015. Interaction quality: Assessing the quality of ongoing spoken dialog interaction by experts - and how it relates to user satisfaction. Speech Commun., 74:12–36.
- Understanding user satisfaction with task-oriented dialogue systems. In Proceedings of the 45th International Association for Computing Machinery SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, page 2018–2023, New York, NY, USA. Association for Computing Machinery.
- Understanding and predicting user satisfaction with conversational recommender systems. ACM Transactions on Information Systems, 42(2):Article 55.
- Simulating user satisfaction for the evaluation of task-oriented dialogue systems. In Proceedings of the 44th International Association for Computing Machinery SIGIR Conference on Research and Development in Information Retrieval, page 2499–2506, New York, NY, USA. Association for Computing Machinery.
- Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
- TOD-BERT: pre-trained natural language understanding for task-oriented dialogue. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 917–929. Association for Computational Linguistics.
- Clemencia Siro (15 papers)
- Mohammad Aliannejadi (85 papers)
- Maarten de Rijke (261 papers)