Where to Go Next for Recommender Systems? ID- vs. Modality-based Recommender Models Revisited (2303.13835v4)

Published 24 Mar 2023 in cs.IR

Abstract: Recommendation models that utilize unique identities (IDs) to represent distinct users and items have been state-of-the-art (SOTA) and dominated the recommender systems (RS) literature for over a decade. Meanwhile, the pre-trained modality encoders, such as BERT and ViT, have become increasingly powerful in modeling the raw modality features of an item, such as text and images. Given this, a natural question arises: can a purely modality-based recommendation model (MoRec) outperforms or matches a pure ID-based model (IDRec) by replacing the itemID embedding with a SOTA modality encoder? In fact, this question was answered ten years ago when IDRec beats MoRec by a strong margin in both recommendation accuracy and efficiency. We aim to revisit this `old' question and systematically study MoRec from several aspects. Specifically, we study several sub-questions: (i) which recommendation paradigm, MoRec or IDRec, performs better in practical scenarios, especially in the general setting and warm item scenarios where IDRec has a strong advantage? does this hold for items with different modality features? (ii) can the latest technical advances from other communities (i.e., natural language processing and computer vision) translate into accuracy improvement for MoRec? (iii) how to effectively utilize item modality representation, can we use it directly or do we have to adjust it with new data? (iv) are there some key challenges for MoRec to be solved in practical applications? To answer them, we conduct rigorous experiments for item recommendations with two popular modalities, i.e., text and vision. We provide the first empirical evidence that MoRec is already comparable to its IDRec counterpart with an expensive end-to-end training method, even for warm item recommendation. Our results potentially imply that the dominance of IDRec in the RS field may be greatly challenged in the future.

Citations (121)

View on Semantic Scholar

Summary

The paper compares IDRec with MoRec, showing that modality encoders like RoBERTa and Swin Transformers achieve competitive performance with SASRec backbones.
The paper demonstrates that sequential neural networks, particularly SASRec, surpass DSSM by effectively leveraging user interaction sequences to boost accuracy.
The paper highlights challenges such as high training costs and convergence issues, pointing to future research in optimized integrations of ID and modality features.

Revisiting ID- vs. Modality-Based Recommender Models: Implications and Future Directions

The paper "Where to Go Next for Recommender Systems? ID- vs. Modality-based Recommender Models Revisited" offers a comprehensive examination of identity-based (IDRec) versus modality-based recommender models (MoRec) within contemporary recommender systems (RS). This analysis explores the comparative performance of these models, particularly focusing on scenarios encompassing warm and regular item settings, distinct from cold-start settings where modality-based approaches traditionally excel.

Comparative Analysis: IDRec and MoRec

The authors methodically analyze longstanding questions in the RS domain by juxtaposing IDRec, which has been state-of-the-art for over a decade, with MoRec, which leverages modality-specific features through pre-trained encoders like BERT for text and Vision Transformers for images. Utilizing large-scale real-world datasets and two predominant RS architectures—DSSM and SASRec—the paper investigates the capability of MoRec to challenge IDRec's dominance, specifically in scenarios that do not solely focus on the cold-start setting.

Key observations from the empirical studies include:

SASRec Superiority: SASRec-based architectures surpass DSSM-based systems consistently, demonstrating the significance of user interaction sequences in improving recommendation accuracy. This performance disparity underscores the importance of utilizing sequential neural networks for exploiting modality features effectively.
MoRec Performance: MoRec models exhibit competitive performance primarily with SASRec backbones. For text-based recommendations, MoRec, equipped with advanced encoders like RoBERTa, holds performance parity with IDRec and even excels in some scenarios. Conversely, MoRec applied to image recommendations achieves comparable results to IDRec, conditional on leveraging high-performing vision encoders like Swin Transformers.
Warm-Start Scenarios: In warm-start settings, where items have numerous interactions, IDRec’s propensity for these items remains evident. Nonetheless, MoRec demonstrates its versatility, achieving comparable performance benchmarks, thereby highlighting its potential for handling diverse recommender scenarios beyond the traditionally emphasized cold-start contexts.

Technological and Methodological Implications

The potential of MoRec to harness advancements from NLP and computer vision (CV) is explored further. The paper identifies that:

Scaling Effects: Larger model variants of modality encoders deliver improved performance in RS tasks, consistent with trends observed in NLP and CV.
Pre-Training Advantage: Pre-trained modality encoders outperform their randomly initialized counterparts. This improvement is particularly apparent for image recommendations, indicating the foundational pre-training’s efficacy in feature extraction for RS tasks.
Transferability of Representations: Despite the inherent potential of employing pre-trained encoders, challenges exist as current modality representations may lack universality required for various RS tasks. Specifically, the traditional two-stage (TS) pipeline shows substantial limitations in leveraging the full richness of modality-based features when compared to end-to-end (E2E) training paradigms.

Challenges and Future Prospects

While the research provides robust insights, several challenges persist, particularly the high computational demand and challenges in training convergence associated with E2E-based MoRec—highlighted by the high training costs and issues with model collapse under improper hyper-parameter tuning. This delineates future research directions in optimizing computational efficiency and exploring enhanced architectures or fusion strategies.

One promising pathway involves devising more sophisticated methods for integrating ID and modality features to complement their respective strengths. Further, investigating transfer learning approaches for modality representations could foster the development of more generalizable recommender systems suited for diverse applications across domains.

Overall, the findings suggest that while identity-based systems maintain their relevance, the evolving landscape, accentuated by deep learning advancements, positions modality-based approaches as viable contenders. This work advocates for a nuanced understanding of RS strategies, urging the exploration of modality-rich insights that align with broader AI developments, thereby heralding potential paradigm shifts in the recommendation ecosystem.

PDF Markdown

Related Papers

GitHub

GitHub - westlake-repl/IDvs.MoRec: Recommender Systems, Modality-based Recommendation, Text Recommendation, Vision Recommendation (150 stars)