Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models are Zero-Shot Rankers for Recommender Systems (2305.08845v2)

Published 15 May 2023 in cs.IR and cs.CL
Large Language Models are Zero-Shot Rankers for Recommender Systems

Abstract: Recently, LLMs (e.g., GPT-4) have demonstrated impressive general-purpose task-solving abilities, including the potential to approach recommendation tasks. Along this line of research, this work aims to investigate the capacity of LLMs that act as the ranking model for recommender systems. We first formalize the recommendation problem as a conditional ranking task, considering sequential interaction histories as conditions and the items retrieved by other candidate generation models as candidates. To solve the ranking task by LLMs, we carefully design the prompting template and conduct extensive experiments on two widely-used datasets. We show that LLMs have promising zero-shot ranking abilities but (1) struggle to perceive the order of historical interactions, and (2) can be biased by popularity or item positions in the prompts. We demonstrate that these issues can be alleviated using specially designed prompting and bootstrapping strategies. Equipped with these insights, zero-shot LLMs can even challenge conventional recommendation models when ranking candidates are retrieved by multiple candidate generators. The code and processed datasets are available at https://github.com/RUCAIBox/LLMRank.

Evaluation of Zero-Shot Ranking by LLMs for Recommender Systems

The paper investigates the efficacy of LLMs, such as GPT-4, in functioning as zero-shot rankers within recommender systems. This novel approach leverages the impressive task-solving ability of LLMs without additional training, recasting the recommendation task into a conditional ranking problem. Herein, the paper sheds light on the capacities and limitations of employing LLMs for ranking tasks in recommendation systems.

The authors formalize the recommendation problem wherein the sequential interaction histories are treated as conditions and the items retrieved by other models are viewed as candidates. The recommended task is thus a conditional ranking one, where LLMs are expected to rank items based on intrinsic knowledge. Through the construction of natural language prompts, this research examines whether LLMs can utilize historical user behaviors and understand user-item relationships for effective ranking.

Extensive experiments are conducted over two popular datasets using specifically designed prompting strategies. The findings are consolidated into key observations regarding the performance of LLMs as zero-shot rankers:

  1. Order Perception Challenges: LLMs typically struggle to ascertain the order of historical interactions. Consequently, novel prompting strategies were devised to cue LLMs to perceive interaction order, effectively leading to improved ranking outcomes compared to the baseline performance.
  2. Bias Issues: The order of candidate presentation significantly impacts LLM ranking performance, indicating a position bias. Furthermore, a predisposition towards recommending popular items (popularity bias) was observed. To mitigate these biases, strategies such as bootstrapping and tailored prompting were proposed, making the ranking results more robust.
  3. Effective Zero-Shot Ranking: The LLMs demonstrated promising zero-shot capabilities, particularly when candidates were derived from various generation models, indicating potential applicability in comprehensive candidate environments. The result suggests a strong aptitude for LLMs to leverage intrinsic knowledge from text features for ranking.

A battery of experiments indicated that LLMs, especially those with a larger parameter space, such as GPT-3.5 and GPT-4, outperformed other zero-shot recommendation methods by a substantial margin and even competed with conventional models specifically trained on datasets. This supports the research's premise that pre-trained LLMs hold substantial and untapped potential for improving recommendation tasks.

The paper situates its findings within the broader context of transfer learning for recommender systems, illustrating that LLMs have capacity beyond narrow-domain tasks, thanks to their pre-training on a vast corpus of language data. It underscores the limitations central to traditional recommendation models which LLMs could potentially ameliorate — particularly in cases where user alignment with candidate recommendation requires broader background knowledge.

While this work sheds light on leveraging LLMs' capabilities in the domain of recommenders, there are challenges such as computational overheads and the inherent biases from LLM training corpora. Future directions could encompass developing mechanisms that allow LLMs to incorporate user feedback for refining recommendations and building hybrid models integrating LLMs with traditional system architecture for improved, scalable performance.

In conclusion, the paper provides foundational insights that could be pivotal in evolving recommender systems into adaptable, context-aware engines capable of leveraging large volumes of semantic data without necessitating extensive re-training. This research opens avenues for advancing recommender system designs utilizing sophisticated LLM capabilities, marking a substantive exploration into AI-driven personalization.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yupeng Hou (33 papers)
  2. Junjie Zhang (79 papers)
  3. Zihan Lin (22 papers)
  4. Hongyu Lu (29 papers)
  5. Ruobing Xie (97 papers)
  6. Julian McAuley (238 papers)
  7. Wayne Xin Zhao (196 papers)
Citations (228)
X Twitter Logo Streamline Icon: https://streamlinehq.com