Concept -- An Evaluation Protocol on Conversational Recommender Systems with System-centric and User-centric Factors (2404.03304v3)
Abstract: The conversational recommendation system (CRS) has been criticized regarding its user experience in real-world scenarios, despite recent significant progress achieved in academia. Existing evaluation protocols for CRS may prioritize system-centric factors such as effectiveness and fluency in conversation while neglecting user-centric aspects. Thus, we propose a new and inclusive evaluation protocol, Concept, which integrates both system- and user-centric factors. We conceptualise three key characteristics in representing such factors and further divide them into six primary abilities. To implement Concept, we adopt a LLM-based user simulator and evaluator with scoring rubrics that are tailored for each primary ability. Our protocol, Concept, serves a dual purpose. First, it provides an overview of the pros and cons in current CRS models. Second, it pinpoints the problem of low usability in the "omnipotent" ChatGPT and offers a comprehensive reference guide for evaluating CRS, thereby setting the foundation for CRS improvement.
- Usersimcrs: A user simulation toolkit for evaluating conversational recommender systems. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pages 1160–1163.
- Social intelligence- empathy= aggression? Aggression and violent behavior, 5(2):191–200.
- Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201.
- Ana Paula Chaves and Marco Aurelio Gerosa. 2021. How should my chatbot interact? a survey on social characteristics in human–chatbot interaction design. International Journal of Human–Computer Interaction, 37(8):729–758.
- Towards knowledge-based recommender dialog system. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1803–1813.
- Is gpt-4 a good data analyst? arXiv preprint arXiv:2305.15038.
- Lm vs lm: Detecting factual errors via cross examination. arXiv preprint arXiv:2305.13281.
- Goal awareness for conversational AI: Proactivity, non-collaborativity, and beyond. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts), pages 1–10, Toronto, Canada. Association for Computational Linguistics.
- Kevin A. Fischer. 2023. Reflective linguistic programming (rlp): A stepping stone in socially-aware agi (socialagi).
- BJ Fogg. 2003. Computers as persuasive social actors.
- Advances and challenges in conversational recommender systems: A survey. AI Open, 2:100–126.
- A knowledge-grounded neural conversation model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
- Sofia Gkika and George Lekakos. 2014. Investigating the effectiveness of persuasion strategies on recommender systems. In 2014 9th International Workshop on Semantic and Social Media Adaptation and Personalization, pages 94–97. IEEE.
- Herbert P Grice. 1975. Logic and conversation. In Speech acts, pages 41–58. Brill.
- Paul Grice. 1989. Studies in the Way of Words. Harvard University Press.
- INSPIRED: Toward sociable recommendation dialog systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8142–8152, Online. Association for Computational Linguistics.
- Reduce human labor on evaluating conversational information retrieval system: A human-machine collaboration approach. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10876–10891, Singapore. Association for Computational Linguistics.
- The gricean maxims of quantity and of relation in the turing test. In 2018 11th international conference on human system interaction (hsi), pages 332–338. IEEE.
- The impact of the gricean maxims of quality, quantity and manner in chatbots. In 2019 international conference on information and digital technologies (idt), pages 180–189. IEEE.
- Dietmar Jannach. 2022. Evaluating conversational recommender systems: A landscape of research. Artificial Intelligence Review, 56.
- Dietmar Jannach. 2023. Evaluating conversational recommender systems: A landscape of research. Artificial Intelligence Review, 56(3):2365–2400.
- Dietmar Jannach and Ahtsham Manzoor. 2020. End-to-end learning for conversational recommendation: A long way to go? In IntRS@ RecSys, pages 72–76.
- A survey on conversational recommender systems. ACM Computing Surveys (CSUR), 54(5):1–36.
- Musicbot: Evaluating critiquing-based music recommenders with conversational interaction. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 951–960.
- Key qualities of conversational recommender systems: From users’ perspective. In Proceedings of the 9th International Conference on Human-Agent Interaction, HAI ’21, page 93–102, New York, NY, USA. Association for Computing Machinery.
- Crs-que: A user-centric evaluation framework for conversational recommender systems. ACM Trans. Recomm. Syst. Just Accepted.
- Situation-aware emotion regulation of conversational agents with kinetic earables. In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), pages 725–731. IEEE.
- Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality. arXiv preprint arXiv:2302.14520.
- Conversational recommendation: Formulation, methods, and evaluation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2425–2428.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880.
- A conversation is worth a thousand recommendations: A survey of holistic conversational recommender systems. arXiv preprint arXiv:2309.07682.
- Collaborative evaluation: Exploring the synergy of large language models and humans for open-ended generation evaluation. arXiv preprint arXiv:2310.19740.
- Towards deep conversational recommendations. Advances in neural information processing systems, 31.
- TREA: Tree-structure reasoning schema for conversational recommendation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2970–2982, Toronto, Canada. Association for Computational Linguistics.
- G-eval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
- Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
- Conversational recommender system and large language model are made for each other in E-commerce pre-sales dialogue. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9587–9605, Singapore. Association for Computational Linguistics.
- Calibrating llm-based evaluator. arXiv preprint arXiv:2309.13308.
- Cr-walker: Tree-structured graph reasoning and dialog acts for conversational recommendation. arXiv preprint arXiv:2010.10333.
- David McSherry. 2005. Explanation in recommender systems. Artificial Intelligence Review, 24:179–197.
- A survey of ad hoc teamwork research. In European Conference on Multi-Agent Systems, pages 275–293. Springer.
- OpenDialKG: Explainable conversational reasoning with attention-based walks over knowledge graphs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 845–854, Florence, Italy. Association for Computational Linguistics.
- Can computers be teammates? International Journal of Human-Computer Studies, 45(6):669–678.
- Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005.
- Perceptions on authenticity in chat bots. Multimodal Technologies and Interaction, 2(3):60.
- Rank list sensitivity of recommender systems to interaction perturbations. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 1584–1594.
- The effects of reward misspecification: Mapping and mitigating misaligned models. In International Conference on Learning Representations.
- Manuel Portela and Carlos Granell-Canut. 2017. A new friend in our smartphone? observing interactions with chatbots in the search of emotional engagement. In Proceedings of the XVIII International Conference on Human Computer Interaction, pages 1–7.
- HutCRS: Hierarchical user-interest tracking for conversational recommender system. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10281–10290, Singapore. Association for Computational Linguistics.
- Byron Reeves and Clifford Nass. 1996. The Media Equation: How People Treat Computers, Television, and New Media like Real People and Places. Cambridge University Press, USA.
- Branch-solve-merge improves large language model evaluation and generation. arXiv preprint arXiv:2310.15123.
- Branch-solve-merge improves large language model evaluation and generation.
- WINDA SARI. 2020. FLOUTING MAXIMS ON SOCIAL MEDIA INSTAGRAM: FOLLOWERS’COMMENTS OF DONALD TRUMP’S CAPTIONS. Ph.D. thesis, STKIP PGRI PACITAN.
- Understanding and predicting user satisfaction with conversational recommender systems. ACM Trans. Inf. Syst., 42(2).
- Barry Smyth and Lorraine McGinty. 2003. An analysis of feedback strategies in conversational recommenders. In the Fourteenth Irish Artificial Intelligence and Cognitive Science Conference (AICS 2003). Citeseer.
- Conversational recommender system. In The 41st international acm sigir conference on research & development in information retrieval, pages 235–244.
- Ekaterina Svikhnushina. 2023. Towards novel evaluation methods for social dialog systems. Technical report, EPFL.
- User expectations of conversational chatbots based on online reviews. In Designing Interactive Systems Conference 2021, pages 1481–1491.
- Not all metrics are guilty: Improving nlg evaluation with llm paraphrasing. arXiv preprint arXiv:2305.15067.
- In-context learning user simulators for task-oriented dialog systems. arXiv preprint arXiv:2306.00774.
- A personalized system for conversational recommendations. Journal of Artificial Intelligence Research, 21:393–428.
- Rabbit: An intelligent database assistant. In AAAI, volume 82, pages 314–318. Citeseer.
- Recommender systems in the healthcare domain: state-of-the-art and research issues. Journal of Intelligent Information Systems, 57:171–201.
- Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. arXiv preprint arXiv:2310.07521.
- Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
- Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
- Barcor: Towards a unified framework for conversational recommendation systems. arXiv preprint arXiv:2203.14257.
- Rethinking the evaluation for conversational recommendation in the era of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10052–10065, Singapore. Association for Computational Linguistics.
- Towards unified conversational recommender systems via knowledge-enhanced prompt learning. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1929–1937.
- How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751.
- Pontus Wärnestål. 2005. User evaluation of a conversational recommender system. In Proceedings of the 4th Workshop on Knowledge and Reasoning in Practical Dialogue Systems.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Minghao Wu and Alham Fikri Aji. 2023. Style over substance: Evaluation biases for large language models. arXiv preprint arXiv:2307.03025.
- On the diversity and explainability of recommender systems: A practical framework for enterprise app recommendation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, CIKM ’21, page 4302–4311, New York, NY, USA. Association for Computing Machinery.
- Flask: Fine-grained language model evaluation based on alignment skill sets. arXiv preprint arXiv:2307.10928.
- Counterfactual explainable conversational recommendation. IEEE Transactions on Knowledge and Data Engineering.
- Evaluating large language models at evaluating instruction following. arXiv preprint arXiv:2310.07641.
- Variational reasoning over incomplete knowledge graphs for conversational recommendation. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pages 231–239.
- Dialogpt: Large-scale generative pre-training for conversational response generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 270–278.
- Crfr: Improving conversational recommender systems via flexible fragments reasoning on knowledge graphs. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4324–4334.
- Aligning recommendation and conversation via dual imitation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 549–561, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- The design and implementation of xiaoice, an empathetic social chatbot. Computational Linguistics, 46(1):53–93.
- Dávid Zibriczky12. 2016. Recommender systems meet finance: a literature review. In Proc. 2nd Int. Workshop Personalization Recommender Syst, pages 1–10.
- Chen Huang (88 papers)
- Peixin Qin (21 papers)
- Yang Deng (113 papers)
- Wenqiang Lei (66 papers)
- Jiancheng Lv (99 papers)
- Tat-Seng Chua (359 papers)