Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Interpretable User Satisfaction Estimation for Conversational Systems with Large Language Models (2403.12388v2)

Published 19 Mar 2024 in cs.IR and cs.AI

Abstract: Accurate and interpretable user satisfaction estimation (USE) is critical for understanding, evaluating, and continuously improving conversational systems. Users express their satisfaction or dissatisfaction with diverse conversational patterns in both general-purpose (ChatGPT and Bing Copilot) and task-oriented (customer service chatbot) conversational systems. Existing approaches based on featurized ML models or text embeddings fall short in extracting generalizable patterns and are hard to interpret. In this work, we show that LLMs can extract interpretable signals of user satisfaction from their natural language utterances more effectively than embedding-based approaches. Moreover, an LLM can be tailored for USE via an iterative prompting framework using supervision from labeled examples. The resulting method, Supervised Prompting for User satisfaction Rubrics (SPUR), not only has higher accuracy but is more interpretable as it scores user satisfaction via learned rubrics with a detailed breakdown.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. OpenAI Ada. Embeddings. Accessed on: Feb 15, 2024.
  2. Constitutional AI: harmlessness from AI feedback. CoRR, abs/2212.08073.
  3. Multi-domain conversation quality evaluation via user satisfaction estimation. CoRR, abs/1911.08567.
  4. Joint turn and dialogue level user satisfaction estimation on multi-domain conversations. CoRR, abs/2010.02495.
  5. Predicting user intents and satisfaction with dialogue-based conversational recommendations. In Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization, UMAP 2020, Genoa, Italy, July 12-18, 2020, pages 33–42. ACM.
  6. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, pages 4299–4307.
  7. User satisfaction estimation with sequential dialogue act modeling in goal-oriented conversational systems. In WWW ’22: The ACM Web Conference 2022, Virtual Event, Lyon, France, April 25 - 29, 2022, pages 2998–3008. ACM.
  8. Survey on evaluation methods for dialogue systems. Artif. Intell. Rev., 54(1):755–810.
  9. Hugginface E5. intfloat/multilingual-e5-large. Accessed on: Accessed on: Feb 15, 2024.
  10. Multiwoz 2.1: A consolidated multi-domain dialogue dataset with state corrections and state tracking baselines. In Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, pages 422–428. European Language Resources Association.
  11. Unlocking the potential of user feedback: Leveraging large language model as user simulators to enhance dialogue system. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM 2023, Birmingham, United Kingdom, October 21-25, 2023, pages 3953–3957. ACM.
  12. Can large language models explain themselves? A study of llm-generated self-explanations. CoRR, abs/2310.11207.
  13. Self-supervised contrastive learning for efficient user satisfaction prediction in conversational agents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 4053–4064. Association for Computational Linguistics.
  14. Self-supervised contrastive learning for efficient user satisfaction prediction in conversational agents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online. Association for Computational Linguistics.
  15. Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback. CoRR, abs/2303.05453.
  16. Large language models are zero-shot reasoners. In NeurIPS.
  17. Can language models learn from explanations in context? In Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 537–563. Association for Computational Linguistics.
  18. Turn-level user satisfaction estimation in E-commerce customer service. In Proceedings of the 4th Workshop on e-Commerce and NLP, pages 26–32, Online. Association for Computational Linguistics.
  19. Yusuf Mehdi. Bringing the full power of copilot to more people and businesses. Accessed on: Feb 15, 2024.
  20. User satisfaction modeling with domain adaptation in task-oriented dialogue systems. In Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, Edinburgh, UK. Association for Computational Linguistics.
  21. Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 8689–8696. AAAI Press.
  22. Hugginface reward deberta. Openassistant/reward-model-deberta-v3-large-v2. Accessed on: Accessed on: Feb 15, 2024.
  23. Alexander Schmitt and Stefan Ultes. 2015. Interaction quality: Assessing the quality of ongoing spoken dialog interaction by experts - and how it relates to user satisfaction. Speech Commun., 74:12–36.
  24. Understanding user satisfaction with task-oriented dialogue systems. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, pages 2018–2023. ACM.
  25. Using customer service dialogues for satisfaction analysis with context-assisted multiple instance learning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP, pages 198–207. Association for Computational Linguistics.
  26. A speaker turn-aware multi-task adversarial network for joint user satisfaction estimation and sentiment analysis. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pages 13582–13590. AAAI Press.
  27. Simulating user satisfaction for the evaluation of task-oriented dialogue systems. In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, pages 2499–2506. ACM.
  28. Text classification via large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 8990–9005. Association for Computational Linguistics.
  29. PARADISE: A framework for evaluating spoken dialogue agents. In 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, 7-12 July 1997, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain, pages 271–280. Morgan Kaufmann Publishers / ACL.
  30. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS.
  31. Hugginface XLM-roBERTa. cardiffnlp/twitter-xlm-roberta-base-sentiment. Accessed on: Accessed on: Feb 15, 2024.
  32. Explanation selection using unlabeled data for chain-of-thought prompting. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 619–637. Association for Computational Linguistics.
  33. Yilmaz Emine Ye Fanghua, Hu Zhiyuan. 2023. Modeling user satisfaction dynamics in dialogue via hawkes process. In The 61st Annual Meeting of the Association for Computational Linguistics (ACL’23).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (17)
  1. Ying-Chun Lin (5 papers)
  2. Jennifer Neville (57 papers)
  3. Jack W. Stokes (16 papers)
  4. Longqi Yang (28 papers)
  5. Tara Safavi (16 papers)
  6. Mengting Wan (24 papers)
  7. Scott Counts (10 papers)
  8. Siddharth Suri (13 papers)
  9. Reid Andersen (9 papers)
  10. Xiaofeng Xu (99 papers)
  11. Deepak Gupta (77 papers)
  12. Sujay Kumar Jauhar (13 papers)
  13. Xia Song (38 papers)
  14. Georg Buscher (5 papers)
  15. Saurabh Tiwary (15 papers)
  16. Brent Hecht (18 papers)
  17. Jaime Teevan (8 papers)
Citations (4)
X Twitter Logo Streamline Icon: https://streamlinehq.com