Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CAUSE: Counterfactual Assessment of User Satisfaction Estimation in Task-Oriented Dialogue Systems (2403.19056v2)

Published 27 Mar 2024 in cs.CL

Abstract: An important unexplored aspect in previous work on user satisfaction estimation for Task-Oriented Dialogue (TOD) systems is their evaluation in terms of robustness for the identification of user dissatisfaction: current benchmarks for user satisfaction estimation in TOD systems are highly skewed towards dialogues for which the user is satisfied. The effect of having a more balanced set of satisfaction labels on performance is unknown. However, balancing the data with more dissatisfactory dialogue samples requires further data collection and human annotation, which is costly and time-consuming. In this work, we leverage LLMs and unlock their ability to generate satisfaction-aware counterfactual dialogues to augment the set of original dialogues of a test collection. We gather human annotations to ensure the reliability of the generated samples. We evaluate two open-source LLMs as user satisfaction estimators on our augmented collection against state-of-the-art fine-tuned models. Our experiments show that when used as few-shot user satisfaction estimators, open-source LLMs show higher robustness to the increase in the number of dissatisfaction labels in the test collection than the fine-tuned state-of-the-art models. Our results shed light on the need for data augmentation approaches for user satisfaction estimation in TOD systems. We release our aligned counterfactual dialogues, which are curated by human annotation, to facilitate further research on this topic.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Improved goal oriented dialogue via utterance generation and look ahead. arXiv preprint arXiv:2110.12412.
  2. Joint turn and dialogue level user satisfaction estimation on multi-domain conversations. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3897–3909.
  3. Inpars: Data augmentation for information retrieval using large language models. arXiv preprint arXiv:2202.05144.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  5. Predicting user intents and satisfaction with dialogue-based conversational recommendations. In Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization, pages 33–42.
  6. DoCoGen: Domain counterfactual generation for low resource domain adaptation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7727–7746, Dublin, Ireland. Association for Computational Linguistics.
  7. User satisfaction estimation with sequential dialogue act modeling in goal-oriented conversational systems. In Proceedings of the ACM Web Conference 2022, pages 2998–3008.
  8. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  9. MultiWOZ 2.1: A consolidated multi-domain dialogue dataset with state corrections and state tracking baselines. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 422–428.
  10. Topic-aware response generation in task-oriented dialogue with unstructured knowledge access. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 7199–7211.
  11. Unlocking the potential of user feedback: Leveraging large language model as user simulators to enhance dialogue system. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM ’23, page 3953–3957, New York, NY, USA. Association for Computing Machinery.
  12. Counterfactual matters: Intrinsic probing for dialogue state tracking. In The First Workshop on Evaluations and Assessments of Neural Conversation Systems, pages 1–6.
  13. Mistral 7b. arXiv preprint arXiv:2310.06825.
  14. To Eun Kim and Aldo Lipani. 2022. A multi-task based neural model to simulate users in goal oriented dialogue systems. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2115–2119.
  15. Coco: Controllable counterfactuals for evaluating dialogue state trackers. In International Conference on Learning Representations.
  16. Large language models as counterfactual generator: Strengths and weaknesses. arXiv preprint arXiv:2305.14791.
  17. Hongyuan Mei and Jason M Eisner. 2017. The neural Hawkes process: A neurally self-modulating multivariate point process. Advances in Neural Information Processing Systems, 30.
  18. Metaicl: Learning to learn in context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2791–2809.
  19. User satisfaction modeling with domain adaptation in task-oriented dialogue systems. In Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 630–636.
  20. True few-shot learning with language models. Advances in neural information processing systems, 34:11054–11070.
  21. Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 8689–8696.
  22. Task2kb: a public task-oriented knowledge base. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, AAAI’23/IAAI’23/EAAI’23. AAAI Press.
  23. Understanding user satisfaction with task-oriented dialogue systems. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2018–2023.
  24. Understanding and predicting user satisfaction with conversational recommender systems. ACM Transactions on Information Systems, 42(2):1–37.
  25. A speaker turn-aware multi-task adversarial network for joint user satisfaction estimation and sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 13582–13590.
  26. Metaphorical user simulators for evaluating task-oriented dialogue systems. ACM Transactions on Information Systems.
  27. Is ChatGPT good at search? investigating large language models as re-ranking agents. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14918–14937, Singapore. Association for Computational Linguistics.
  28. Simulating user satisfaction for the evaluation of task-oriented dialogue systems. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2499–2506.
  29. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
  30. Task-oriented dialogue system as natural language generation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2698–2703.
  31. Modeling user satisfaction dynamics in dialogue via Hawkes process. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8875–8889, Toronto, Canada. Association for Computational Linguistics.
  32. Do large language models know what they don’t know? In Findings of the Association for Computational Linguistics: ACL 2023, pages 8653–8665, Toronto, Canada. Association for Computational Linguistics.
  33. Generate rather than retrieve: Large language models are strong context generators. In The Eleventh International Conference on Learning Representations.
  34. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pages 12697–12706. PMLR.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Amin Abolghasemi (9 papers)
  2. Zhaochun Ren (117 papers)
  3. Arian Askari (19 papers)
  4. Mohammad Aliannejadi (85 papers)
  5. Maarten de Rijke (261 papers)
  6. Suzan Verberne (57 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com