Beyond Utility: Evaluating LLM as Recommender
Abstract: With the rapid development of LLMs, recent studies employed LLMs as recommenders to provide personalized information services for distinct users. Despite efforts to improve the accuracy of LLM-based recommendation models, relatively little attention is paid to beyond-utility dimensions. Moreover, there are unique evaluation aspects of LLM-based recommendation models, which have been largely ignored. To bridge this gap, we explore four new evaluation dimensions and propose a multidimensional evaluation framework. The new evaluation dimensions include: 1) history length sensitivity, 2) candidate position bias, 3) generation-involved performance, and 4) hallucinations. All four dimensions have the potential to impact performance, but are largely unnecessary for consideration in traditional systems. Using this multidimensional evaluation framework, along with traditional aspects, we evaluate the performance of seven LLM-based recommenders, with three prompting strategies, comparing them with six traditional models on both ranking and re-ranking tasks on four datasets. We find that LLMs excel at handling tasks with prior knowledge and shorter input histories in the ranking setting, and perform better in the re-ranking setting, beating traditional models across multiple dimensions. However, LLMs exhibit substantial candidate position bias issues, and some models hallucinate non-existent items much more often than others. We intend our evaluation framework and observations to benefit future research on the use of LLMs as recommenders. The code and data are available at https://github.com/JiangDeccc/EvaLLMasRecommender.
- Managing popularity bias in recommender systems with personalized re-ranking. arXiv preprint arXiv:1901.07555 (2019).
- User-centered evaluation of popularity bias in recommender systems. In Proceedings of the 29th ACM conference on user modeling, adaptation and personalization. 119–129.
- Anthropic. [n. d.]. The Claude 3 Model Family: Opus, Sonnet, Haiku. https://api.semanticscholar.org/CorpusID:268232499
- A Bi-Step Grounding Paradigm for Large Language Models in Recommendation Systems. arXiv:2308.08434
- TALLRec: An Effective and Efficient Tuning Framework to Align Large Language Model with Recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems (RecSys ’23). ACM. https://doi.org/10.1145/3604915.3608857
- Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS ’20). Curran Associates Inc., Red Hook, NY, USA, Article 159, 25 pages.
- PaLM: scaling language modeling with pathways. J. Mach. Learn. Res. 24, 1, Article 240 (mar 2024), 113Â pages.
- Uncovering ChatGPT’s Capabilities in Recommender Systems. In Proceedings of the 17th ACM Conference on Recommender Systems (RecSys ’23). ACM. https://doi.org/10.1145/3604915.3610646
- Yashar Deldjoo. 2024a. FairEvalLLM. A Comprehensive Framework for Benchmarking Fairness in Large Language Model Recommender Systems. arXiv:2405.02219 https://arxiv.org/abs/2405.02219
- Yashar Deldjoo. 2024b. Understanding Biases in ChatGPT-based Recommender Systems: Provider Fairness, Temporal Stability, and Recency. arXiv:2401.10545 https://arxiv.org/abs/2401.10545
- Yashar Deldjoo and Tommaso di Noia. 2024. CFaiRLLM: Consumer Fairness Evaluation in Large-Language Model Recommender System. arXiv:2403.05668 https://arxiv.org/abs/2403.05668
- Large Language Model with Graph Convolution for Recommendation. arXiv:2402.08859 https://arxiv.org/abs/2402.08859
- Beyond accuracy: evaluating recommender systems by coverage and serendipity. In Proceedings of the Fourth ACM Conference on Recommender Systems (RecSys ’10). ACM, New York, NY, USA, 257–260. https://doi.org/10.1145/1864708.1864761
- Towards long-term fairness in recommendation. In Proceedings of the 14th ACM international conference on web search and data mining (WSDM ’21). 445–453.
- Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5). In Proceedings of the 16th ACM Conference on Recommender Systems (RecSys ’22). ACM, 299–315. https://doi.org/10.1145/3523227.3546767
- Leveraging Large Language Models for Sequential Recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems (RecSys ’23). ACM. https://doi.org/10.1145/3604915.3610639
- LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’20). ACM, 639–648. https://doi.org/10.1145/3397271.3401063
- Measuring Massive Multitask Language Understanding. In International Conference on Learning Representations (ICLR ’21). https://openreview.net/forum?id=d7KBjmI3GmQ
- Session-based Recommendations with Recurrent Neural Networks. arXiv:1511.06939
- Large Language Models are Zero-Shot Rankers for Recommender Systems. In Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024, Proceedings, Part II. Springer-Verlag, 364–381. https://doi.org/10.1007/978-3-031-56060-6_24
- How to Index Item IDs for Recommendation Foundation Models. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (SIGIR-AP ’23). ACM. https://doi.org/10.1145/3624918.3625339
- Neil Hurley and Mi Zhang. 2011. Novelty and diversity in top-n recommendation–analysis and evaluation. ACM Transactions on Internet Technology (TOIT) 10, 4 (2011), 1–30.
- Item-side Fairness of Large Language Model-based Recommendation System. In Proceedings of the ACM on Web Conference 2024 (WWW ’24). ACM, 4717–4726. https://doi.org/10.1145/3589334.3648158
- Marius Kaminskas and Derek Bridge. 2016. Diversity, Serendipity, Novelty, and Coverage: A Survey and Empirical Analysis of Beyond-Accuracy Objectives in Recommender Systems. ACM Trans. Interact. Intell. Syst. 7, 1, Article 2 (dec 2016), 42Â pages. https://doi.org/10.1145/2926720
- Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM). IEEE, 197–206.
- Do LLMs Understand User Preferences? Evaluating LLMs On User Rating Prediction. arXiv:2305.06474
- Large Language Models are Zero-Shot Reasoners. In Advances in Neural Information Processing Systems (NIPS ’22, Vol. 35). Curran Associates, Inc., 22199–22213. https://proceedings.neurips.cc/paper_files/paper/2022/file/8bb0d291acd4acf06ef112099c16f326-Paper-Conference.pdf
- Prompt Distillation for Efficient LLM-based Recommendation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23). ACM, New York, NY, USA, 1348–1357. https://doi.org/10.1145/3583780.3615017
- AlpacaEval: An Automatic Evaluator of Instruction-following Models. https://github.com/tatsu-lab/alpaca_eval.
- A Preliminary Study of ChatGPT on News Recommendation: Personalization, Provider Fairness, Fake News. arXiv:2306.10702 https://arxiv.org/abs/2306.10702
- User-oriented fairness in recommendation. In Proceedings of the web conference 2021 (WWW ’21). 624–632.
- Jianhua Lin. 1991. Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory 37, 1 (1991), 145–151.
- Bridging Items and Language: A Transition Paradigm for Large Language Model-Based Recommendation. arXiv:2310.06491
- Is ChatGPT a Good Recommender? A Preliminary Study. arXiv:2304.10149
- LLMRec: Benchmarking Large Language Models on Recommendation Task. arXiv:2308.12241
- ONCE: Boosting Content-based Recommendation with Both Open- and Closed-source Large Language Models. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24). ACM, 452–461. https://doi.org/10.1145/3616855.3635845
- Mistral. 2024. Mistral 7B: The best 7B model to date. https://mistral.ai/news/announcing-mistral-7b/
- OpenAI. 2024. GPT-4 Technical Report. arXiv:2303.08774
- Evaluating ChatGPT as a Recommender System: A Rigorous Approach. arXiv:2309.03613
- Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! arXiv:2310.03693
- Recommender Systems with Generative Retrieval. In Thirty-seventh Conference on Neural Information Processing Systems (NIPS ’23). https://openreview.net/forum?id=BJ0fQUU32w
- Representation Learning with Large Language Models for Recommendation. In Proceedings of the ACM on Web Conference 2024 (WWW ’24). ACM, 3464–3475. https://doi.org/10.1145/3589334.3645458
- BPR: Bayesian personalized ranking from implicit feedback (UAI ’09). AUAI Press, Arlington, Virginia, USA, 452–461.
- Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr. 3, 4 (apr 2009), 333–389. https://doi.org/10.1561/1500000019
- Towards understanding and mitigating unintended biases in language model-driven conversational recommendation. Inf. Process. Manage. 60, 1 (jan 2023), 21Â pages. https://doi.org/10.1016/j.ipm.2022.103139
- Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 14918–14937. https://doi.org/10.18653/v1/2023.emnlp-main.923
- Large Language Models for Intent-Driven Session Recommendations. arXiv:2312.07552 https://arxiv.org/abs/2312.07552
- LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971
- Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv preprint arXiv:2306.11698 (2023).
- Make it a chorus: knowledge-and time-aware item modeling for sequential recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 109–118.
- TransRec: Learning Transferable Recommendation from Mixture-of-Modality Feedback. arXiv:2206.06190 https://arxiv.org/abs/2206.06190
- Lei Wang and Ee-Peng Lim. 2023. Zero-Shot Next-Item Recommendation using Large Pretrained Language Models. arXiv:2304.03153
- Lei Wang and Ee-Peng Lim. 2024. The Whole is Better than the Sum: Using Aggregated Demonstrations in In-Context Learning for Sequential Recommendation. In Findings of the Association for Computational Linguistics: NAACL 2024. Association for Computational Linguistics, 876–895. https://doi.org/10.18653/v1/2024.findings-naacl.56
- Large Language Models are not Fair Evaluators. arXiv:2305.17926
- A Survey on the Fairness of Recommender Systems. ACM Trans. Inf. Syst. 41, 3, Article 52 (feb 2023), 43Â pages. https://doi.org/10.1145/3547333
- Re2LLM: Reflective Reinforcement Large Language Model for Session-based Recommendation. arXiv:2403.16427
- TF-DCon: Leveraging Large Language Models (LLMs) to Empower Training-Free Dataset Condensation for Content-Based Recommendation. arXiv:2310.09874 https://arxiv.org/abs/2310.09874
- OpenP5: An Open-Source Platform for Developing, Training, and Evaluating LLM-based Recommender Systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24). ACM, 386–394. https://doi.org/10.1145/3626772.3657883
- Hallucination is Inevitable: An Innate Limitation of Large Language Models. arXiv:2401.11817
- Qwen2 Technical Report. arXiv:2407.10671Â [cs.CL] https://arxiv.org/abs/2407.10671
- Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large Language Model Recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems (RecSys ’23). ACM, 993–999. https://doi.org/10.1145/3604915.3608860
- Recommendation as Instruction Following: A Large Language Model Empowered Recommendation Approach. arXiv:2305.07001
- OPT: Open Pre-trained Transformer Language Models. arXiv:2205.01068 https://arxiv.org/abs/2205.01068
- Text-like Encoding of Collaborative Information in Large Language Models for Recommendation. arXiv:2406.03210 https://arxiv.org/abs/2406.03210
- CoLLM: Integrating Collaborative Embeddings into Large Language Models for Recommendation. arXiv:2310.19488 https://arxiv.org/abs/2310.19488
- Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arXiv:2309.01219
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685
- Solving the apparent diversity-accuracy dilemma of recommender systems. Proceedings of the National Academy of Sciences 107, 10 (Feb. 2010), 4511–4515. https://doi.org/10.1073/pnas.1000488107
- Popularity bias in dynamic recommendation. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 2439–2449.
- Measuring and Mitigating Item Under-Recommendation Bias in Personalized Ranking Systems. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’20). ACM, 449–458. https://doi.org/10.1145/3397271.3401177
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.