Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond Utility: Evaluating LLM as Recommender (2411.00331v1)

Published 1 Nov 2024 in cs.IR
Beyond Utility: Evaluating LLM as Recommender

Abstract: With the rapid development of LLMs, recent studies employed LLMs as recommenders to provide personalized information services for distinct users. Despite efforts to improve the accuracy of LLM-based recommendation models, relatively little attention is paid to beyond-utility dimensions. Moreover, there are unique evaluation aspects of LLM-based recommendation models, which have been largely ignored. To bridge this gap, we explore four new evaluation dimensions and propose a multidimensional evaluation framework. The new evaluation dimensions include: 1) history length sensitivity, 2) candidate position bias, 3) generation-involved performance, and 4) hallucinations. All four dimensions have the potential to impact performance, but are largely unnecessary for consideration in traditional systems. Using this multidimensional evaluation framework, along with traditional aspects, we evaluate the performance of seven LLM-based recommenders, with three prompting strategies, comparing them with six traditional models on both ranking and re-ranking tasks on four datasets. We find that LLMs excel at handling tasks with prior knowledge and shorter input histories in the ranking setting, and perform better in the re-ranking setting, beating traditional models across multiple dimensions. However, LLMs exhibit substantial candidate position bias issues, and some models hallucinate non-existent items much more often than others. We intend our evaluation framework and observations to benefit future research on the use of LLMs as recommenders. The code and data are available at https://github.com/JiangDeccc/EvaLLMasRecommender.

Evaluating LLMs in Recommender Systems: A Multidimensional Framework

The paper "Beyond Utility: Evaluating LLM as Recommender" addresses the evolving role of LLMs within Recommender Systems (RSs). As LLMs like GPT, Claude, and Llama demonstrate significant prowess across diverse NLP tasks, their applicability as recommenders is increasingly being explored. However, conventional evaluations of recommender systems primarily focus on accuracy, leaving other crucial dimensions underexplored when it comes to LLMs. This paper introduces a multidimensional evaluation framework tailored to identify specific LLM-related characteristics in RS applications, going beyond traditional evaluation dimensions.

Multidimensional Evaluation Framework

The paper proposes a comprehensive evaluation framework that includes both traditional dimensions such as utility and novelty, as well as four novel dimensions specific to LLMs: history length sensitivity, candidate position bias, generation-involved performance, and hallucinations. This framework aims to provide a holistic understanding of the capabilities and limitations of LLMs when deployed in RSs.

  1. History Length Sensitivity: This dimension examines how the length of user history input affects the performance of LLM-based recommenders. The paper finds LLMs to perform optimally in cold-start scenarios owing to their capacity to leverage world knowledge, demonstrating that even with minimal user data, they can deliver competitive results.
  2. Candidate Position Bias: LLMs exhibit a notable bias towards items placed at the start of a candidate list, a trait not relevant to traditional models. The paper quantitatively captures this bias and discusses its detrimental impact on recommendation accuracy, advocating for further methodological innovations to mitigate such bias.
  3. Generation-Involved Performance: By generating rich, textual user profiles, LLMs can provide explainable recommendations. This dimension evaluates the effect of incorporating such generative capabilities. Profiling enhances explainability, though the benefit to recommendation accuracy varies, particularly when longer user histories outperform condensed profiles.
  4. Hallucinations: The paper addresses the issue of hallucinations, where LLMs may produce non-existent items in recommendations. While the incidence is generally below 5%, the presence of hallucinations continues to present user experience challenges, necessitating robust mapping techniques to eliminate these inaccuracies.

Implications and Future Directions

The empirical evaluation involving prominent LLMs like GPT-4o and Claude-3 reveals the nuanced strengths and weaknesses of LLMs: they excel in domain-specific knowledge applications and perform robustly in cold-start situations but fall short on longer-tail user histories and exhibit significant candidate position biases. The paper suggests that LLM-powered RSs can surpass traditional models, especially in re-ranking tasks, by leveraging inherent LLM features such as world knowledge and generative capabilities.

While LLMs display promise in enhancing RSs, the paper posits several future research directions. Addressing candidate position bias, refining hallucination mitigation strategies, and optimizing the integration of user profiles can significantly uplift the performance and reliability of LLM-based RSs. Furthermore, exploring the potential of fine-tuning LLMs on recommendation data may bridge existing gaps in collaborative filtering insights.

This comprehensive framework not only facilitates the evaluation of current LLM implementations in RSs but also sets the foundation for future research, encouraging the development of more refined, efficient, and user-centric recommendation solutions powered by LLMs. As the field progresses, such multidimensional evaluation frameworks will become essential in comparing, adapting, and ultimately harnessing the full potential of LLMs in diverse real-world applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Managing popularity bias in recommender systems with personalized re-ranking. arXiv preprint arXiv:1901.07555 (2019).
  2. User-centered evaluation of popularity bias in recommender systems. In Proceedings of the 29th ACM conference on user modeling, adaptation and personalization. 119–129.
  3. Anthropic. [n. d.]. The Claude 3 Model Family: Opus, Sonnet, Haiku. https://api.semanticscholar.org/CorpusID:268232499
  4. A Bi-Step Grounding Paradigm for Large Language Models in Recommendation Systems. arXiv:2308.08434
  5. TALLRec: An Effective and Efficient Tuning Framework to Align Large Language Model with Recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems (RecSys ’23). ACM. https://doi.org/10.1145/3604915.3608857
  6. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS ’20). Curran Associates Inc., Red Hook, NY, USA, Article 159, 25 pages.
  7. PaLM: scaling language modeling with pathways. J. Mach. Learn. Res. 24, 1, Article 240 (mar 2024), 113 pages.
  8. Uncovering ChatGPT’s Capabilities in Recommender Systems. In Proceedings of the 17th ACM Conference on Recommender Systems (RecSys ’23). ACM. https://doi.org/10.1145/3604915.3610646
  9. Yashar Deldjoo. 2024a. FairEvalLLM. A Comprehensive Framework for Benchmarking Fairness in Large Language Model Recommender Systems. arXiv:2405.02219 https://arxiv.org/abs/2405.02219
  10. Yashar Deldjoo. 2024b. Understanding Biases in ChatGPT-based Recommender Systems: Provider Fairness, Temporal Stability, and Recency. arXiv:2401.10545 https://arxiv.org/abs/2401.10545
  11. Yashar Deldjoo and Tommaso di Noia. 2024. CFaiRLLM: Consumer Fairness Evaluation in Large-Language Model Recommender System. arXiv:2403.05668 https://arxiv.org/abs/2403.05668
  12. Large Language Model with Graph Convolution for Recommendation. arXiv:2402.08859 https://arxiv.org/abs/2402.08859
  13. Beyond accuracy: evaluating recommender systems by coverage and serendipity. In Proceedings of the Fourth ACM Conference on Recommender Systems (RecSys ’10). ACM, New York, NY, USA, 257–260. https://doi.org/10.1145/1864708.1864761
  14. Towards long-term fairness in recommendation. In Proceedings of the 14th ACM international conference on web search and data mining (WSDM ’21). 445–453.
  15. Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5). In Proceedings of the 16th ACM Conference on Recommender Systems (RecSys ’22). ACM, 299–315. https://doi.org/10.1145/3523227.3546767
  16. Leveraging Large Language Models for Sequential Recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems (RecSys ’23). ACM. https://doi.org/10.1145/3604915.3610639
  17. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’20). ACM, 639–648. https://doi.org/10.1145/3397271.3401063
  18. Measuring Massive Multitask Language Understanding. In International Conference on Learning Representations (ICLR ’21). https://openreview.net/forum?id=d7KBjmI3GmQ
  19. Session-based Recommendations with Recurrent Neural Networks. arXiv:1511.06939
  20. Large Language Models are Zero-Shot Rankers for Recommender Systems. In Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024, Proceedings, Part II. Springer-Verlag, 364–381. https://doi.org/10.1007/978-3-031-56060-6_24
  21. How to Index Item IDs for Recommendation Foundation Models. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (SIGIR-AP ’23). ACM. https://doi.org/10.1145/3624918.3625339
  22. Neil Hurley and Mi Zhang. 2011. Novelty and diversity in top-n recommendation–analysis and evaluation. ACM Transactions on Internet Technology (TOIT) 10, 4 (2011), 1–30.
  23. Item-side Fairness of Large Language Model-based Recommendation System. In Proceedings of the ACM on Web Conference 2024 (WWW ’24). ACM, 4717–4726. https://doi.org/10.1145/3589334.3648158
  24. Marius Kaminskas and Derek Bridge. 2016. Diversity, Serendipity, Novelty, and Coverage: A Survey and Empirical Analysis of Beyond-Accuracy Objectives in Recommender Systems. ACM Trans. Interact. Intell. Syst. 7, 1, Article 2 (dec 2016), 42 pages. https://doi.org/10.1145/2926720
  25. Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM). IEEE, 197–206.
  26. Do LLMs Understand User Preferences? Evaluating LLMs On User Rating Prediction. arXiv:2305.06474
  27. Large Language Models are Zero-Shot Reasoners. In Advances in Neural Information Processing Systems (NIPS ’22, Vol. 35). Curran Associates, Inc., 22199–22213. https://proceedings.neurips.cc/paper_files/paper/2022/file/8bb0d291acd4acf06ef112099c16f326-Paper-Conference.pdf
  28. Prompt Distillation for Efficient LLM-based Recommendation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23). ACM, New York, NY, USA, 1348–1357. https://doi.org/10.1145/3583780.3615017
  29. AlpacaEval: An Automatic Evaluator of Instruction-following Models. https://github.com/tatsu-lab/alpaca_eval.
  30. A Preliminary Study of ChatGPT on News Recommendation: Personalization, Provider Fairness, Fake News. arXiv:2306.10702 https://arxiv.org/abs/2306.10702
  31. User-oriented fairness in recommendation. In Proceedings of the web conference 2021 (WWW ’21). 624–632.
  32. Jianhua Lin. 1991. Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory 37, 1 (1991), 145–151.
  33. Bridging Items and Language: A Transition Paradigm for Large Language Model-Based Recommendation. arXiv:2310.06491
  34. Is ChatGPT a Good Recommender? A Preliminary Study. arXiv:2304.10149
  35. LLMRec: Benchmarking Large Language Models on Recommendation Task. arXiv:2308.12241
  36. ONCE: Boosting Content-based Recommendation with Both Open- and Closed-source Large Language Models. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24). ACM, 452–461. https://doi.org/10.1145/3616855.3635845
  37. Mistral. 2024. Mistral 7B: The best 7B model to date. https://mistral.ai/news/announcing-mistral-7b/
  38. OpenAI. 2024. GPT-4 Technical Report. arXiv:2303.08774
  39. Evaluating ChatGPT as a Recommender System: A Rigorous Approach. arXiv:2309.03613
  40. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! arXiv:2310.03693
  41. Recommender Systems with Generative Retrieval. In Thirty-seventh Conference on Neural Information Processing Systems (NIPS ’23). https://openreview.net/forum?id=BJ0fQUU32w
  42. Representation Learning with Large Language Models for Recommendation. In Proceedings of the ACM on Web Conference 2024 (WWW ’24). ACM, 3464–3475. https://doi.org/10.1145/3589334.3645458
  43. BPR: Bayesian personalized ranking from implicit feedback (UAI ’09). AUAI Press, Arlington, Virginia, USA, 452–461.
  44. Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr. 3, 4 (apr 2009), 333–389. https://doi.org/10.1561/1500000019
  45. Towards understanding and mitigating unintended biases in language model-driven conversational recommendation. Inf. Process. Manage. 60, 1 (jan 2023), 21 pages. https://doi.org/10.1016/j.ipm.2022.103139
  46. Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 14918–14937. https://doi.org/10.18653/v1/2023.emnlp-main.923
  47. Large Language Models for Intent-Driven Session Recommendations. arXiv:2312.07552 https://arxiv.org/abs/2312.07552
  48. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971
  49. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv preprint arXiv:2306.11698 (2023).
  50. Make it a chorus: knowledge-and time-aware item modeling for sequential recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 109–118.
  51. TransRec: Learning Transferable Recommendation from Mixture-of-Modality Feedback. arXiv:2206.06190 https://arxiv.org/abs/2206.06190
  52. Lei Wang and Ee-Peng Lim. 2023. Zero-Shot Next-Item Recommendation using Large Pretrained Language Models. arXiv:2304.03153
  53. Lei Wang and Ee-Peng Lim. 2024. The Whole is Better than the Sum: Using Aggregated Demonstrations in In-Context Learning for Sequential Recommendation. In Findings of the Association for Computational Linguistics: NAACL 2024. Association for Computational Linguistics, 876–895. https://doi.org/10.18653/v1/2024.findings-naacl.56
  54. Large Language Models are not Fair Evaluators. arXiv:2305.17926
  55. A Survey on the Fairness of Recommender Systems. ACM Trans. Inf. Syst. 41, 3, Article 52 (feb 2023), 43 pages. https://doi.org/10.1145/3547333
  56. Re2LLM: Reflective Reinforcement Large Language Model for Session-based Recommendation. arXiv:2403.16427
  57. TF-DCon: Leveraging Large Language Models (LLMs) to Empower Training-Free Dataset Condensation for Content-Based Recommendation. arXiv:2310.09874 https://arxiv.org/abs/2310.09874
  58. OpenP5: An Open-Source Platform for Developing, Training, and Evaluating LLM-based Recommender Systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24). ACM, 386–394. https://doi.org/10.1145/3626772.3657883
  59. Hallucination is Inevitable: An Innate Limitation of Large Language Models. arXiv:2401.11817
  60. Qwen2 Technical Report. arXiv:2407.10671 [cs.CL] https://arxiv.org/abs/2407.10671
  61. Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large Language Model Recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems (RecSys ’23). ACM, 993–999. https://doi.org/10.1145/3604915.3608860
  62. Recommendation as Instruction Following: A Large Language Model Empowered Recommendation Approach. arXiv:2305.07001
  63. OPT: Open Pre-trained Transformer Language Models. arXiv:2205.01068 https://arxiv.org/abs/2205.01068
  64. Text-like Encoding of Collaborative Information in Large Language Models for Recommendation. arXiv:2406.03210 https://arxiv.org/abs/2406.03210
  65. CoLLM: Integrating Collaborative Embeddings into Large Language Models for Recommendation. arXiv:2310.19488 https://arxiv.org/abs/2310.19488
  66. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arXiv:2309.01219
  67. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685
  68. Solving the apparent diversity-accuracy dilemma of recommender systems. Proceedings of the National Academy of Sciences 107, 10 (Feb. 2010), 4511–4515. https://doi.org/10.1073/pnas.1000488107
  69. Popularity bias in dynamic recommendation. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 2439–2449.
  70. Measuring and Mitigating Item Under-Recommendation Bias in Personalized Ranking Systems. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’20). ACM, 449–458. https://doi.org/10.1145/3397271.3401177
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Chumeng Jiang (4 papers)
  2. Jiayin Wang (17 papers)
  3. Weizhi Ma (43 papers)
  4. Charles L. A. Clarke (30 papers)
  5. Shuai Wang (466 papers)
  6. Chuhan Wu (86 papers)
  7. Min Zhang (630 papers)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com