How to Index Item IDs for Recommendation Foundation Models (2305.06569v6)

Published 11 May 2023 in cs.IR, cs.AI, cs.CL, and cs.LG

Abstract: Recommendation foundation model utilizes LLMs (LLM) for recommendation by converting recommendation tasks into natural language tasks. It enables generative recommendation which directly generates the item(s) to recommend rather than calculating a ranking score for each and every candidate item as in traditional recommendation models, simplifying the recommendation pipeline from multi-stage filtering to single-stage filtering. To avoid generating excessively long text and hallucinated recommendations when deciding which item(s) to recommend, creating LLM-compatible item IDs to uniquely identify each item is essential for recommendation foundation models. In this study, we systematically examine the item ID creation and indexing problem for recommendation foundation models, using P5 as an example of the backbone LLM. To emphasize the importance of item indexing, we first discuss the issues of several trivial item indexing methods, such as random indexing, title indexing, and independent indexing. We then propose four simple yet effective solutions, including sequential indexing, collaborative indexing, semantic (content-based) indexing, and hybrid indexing. Our study highlights the significant influence of item indexing methods on the performance of LLM-based recommendation, and our results on real-world datasets validate the effectiveness of our proposed solutions. The research also demonstrates how recent advances on LLMing and traditional IR principles such as indexing can help each other for better learning and inference. Source code and data are available at https://github.com/Wenyueh/LLM-RecSys-ID.

PDF Abstract

How to Index Item IDs for Recommendation Foundation Models

The paper "How to Index Item IDs for Recommendation Foundation Models" explores the critical role of item indexing in enhancing the performance of recommendation systems that utilize foundation models such as LLMs. These models transform recommendation tasks into natural language tasks, directly generating recommended items instead of employing conventional ranking mechanisms. A significant challenge in this approach is ensuring that the generated text corresponds to actual items in the database, thereby addressing the hallucination problem—where the model generates non-existent items. This necessitates the creation of unique, LLM-compatible item IDs.

The authors assess various item indexing methods for recommendation systems, particularly within the context of the P5 recommendation foundation model. This model exemplifies the capability of converting recommendation tasks to language generation tasks through LLMs, by employing pre-training on diverse personalized prompts and recommendation data. The paper reviews and critiques several trivial indexing methods, demonstrating their limitations:

Random Indexing (RID): Assigns each item a random number which is tokenized, potentially causing unrelated items to share common tokens, thereby introducing noise in model training due to these arbitrary relationships.
Title Indexing (TID): Uses item titles as IDs, which can introduce linguistic biases as titles may not reflect true content, misguiding the model due to overly long generation tasks and irrelevant semantic overlaps.
Independent Indexing (IID): Assigns a unique token for each item, alleviating some issues of RID and TID, but still regards items as independently significant, potentially increasing training complexity.

To enhance performance, the paper introduces four innovative indexing methods that leverage prior information to integrate efficient item representation while maintaining manageable ID lengths:

Sequential Indexing (SID): Items are indexed based on their occurrence order across user interactions, capturing collaborative information by associating consecutive IDs to co-occurring items. However, its reliance on user and item ordering introduces variability in performance.
Collaborative Indexing (CID): Utilizes spectral clustering on a co-appearance graph derived from user interactions to group items into hierarchical clusters, ensuring that frequently co-occurring items share significantly more tokens.
Semantic (Content-based) Indexing (SemID): Creates indices based on hierarchical item metadata, employing category information to ensure items of similar nature share the identification tokens.
Hybrid Indexing (HID): Combines methods such as CID or SemID with IID to capitalize on expressive, distinguishing power of individual item tokens while embedding collaborative or semantic data for improved learning.

The empirical analysis on datasets like Amazon Sports, Amazon Beauty, and Yelp demonstrates that CID, SemID, and their hybrid versions significantly outperform traditional indexing methods and competitive baselines such as SASRec and S $^3$ -Rec.

Therefore, these insights underline the profound effects that item indexing strategies have on the performance of recommendation systems relying on foundation models. The integration of prior item relationships alongside LLM compatibility presents a promising approach to advancing recommender systems. Future research could focus on exploring dynamic indexing mechanisms or adaptive models that adjust to evolving item databases, further harmonizing the recommendation process with LLM paradigms.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Wenyue Hua (51 papers)
Shuyuan Xu (31 papers)
Yingqiang Ge (36 papers)
Yongfeng Zhang (163 papers)

Citations (80)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - Wenyueh/LLM-RecSys-ID: How to Index Item IDs for Recommendation Foundation Models (80 stars)