A Comprehensive Survey on Multimodal Recommender Systems: Taxonomy, Evaluation, and Future Directions (2302.04473v1)

Published 9 Feb 2023 in cs.IR and cs.MM

Abstract: Recommendation systems have become popular and effective tools to help users discover their interesting items by modeling the user preference and item property based on implicit interactions (e.g., purchasing and clicking). Humans perceive the world by processing the modality signals (e.g., audio, text and image), which inspired researchers to build a recommender system that can understand and interpret data from different modalities. Those models could capture the hidden relations between different modalities and possibly recover the complementary information which can not be captured by a uni-modal approach and implicit interactions. The goal of this survey is to provide a comprehensive review of the recent research efforts on the multimodal recommendation. Specifically, it shows a clear pipeline with commonly used techniques in each step and classifies the models by the methods used. Additionally, a code framework has been designed that helps researchers new in this area to understand the principles and techniques, and easily runs the SOTA models. Our framework is located at: https://github.com/enoche/MMRec

Citations (24)

View on Semantic Scholar

Summary

The paper presents an extensive taxonomy and evaluation framework for multimodal recommender systems, outlining diverse modeling techniques.
It methodically reviews models from matrix factorization to graph neural networks and self-supervised learning, highlighting their advantages and challenges.
The survey emphasizes future research needs, including effective modality fusion and standardized evaluation practices for real-world applications.

A Comprehensive Survey on Multimodal Recommender Systems: Taxonomy, Evaluation, and Future Directions

This paper presents an extensive survey of multimodal recommender systems (MMRec), highlighting key aspects such as taxonomy, evaluation metrics, and future research directions. The authors aim to consolidate current advancements and provide a structured overview to aid researchers entering the field.

Overview of Multimodal Recommender Systems

Multimodal recommender systems leverage data from various modalities like text, image, and audio to enhance recommendation performance. Traditional recommender systems often depend on collaborative filtering and content-based methods, which have limitations in handling sparse data and cold-start scenarios. In contrast, MMRec systems integrate auxiliary multimodal information, capturing richer interactions and uncovering hidden patterns.

Taxonomy and Classification

The authors classify MMRec models into several categories based on the methods employed: Matrix Factorization, Multilayer Perceptron, (Variational) Autoencoder, Attention Networks, Graph Neural Networks (GNN), Self-supervised Learning, and Pretraining. The paper provides detailed insights into each class, discussing specific models and their unique approaches to utilizing multimodal data.

Matrix Factorization Models: These models, such as VBPR, incorporate visual features into item representations using linear transformations and concatenate them with ID embeddings, employing Matrix Factorization for preference prediction.
Deep Learning Approaches: Including CNNs, attention mechanisms, and RNNs, these methods capture user preferences and item semantics more richly. For instance, attention mechanisms can model user-specific preferences for different item aspects.
Graph Neural Networks: GNNs address the interaction data by representing them as graphs, enabling advanced neighborhood aggregation. Variants like MMGCN and DualGNN show improved performance by addressing inter-item and user-item interactions.
Self-supervised Learning and Pretraining: These methods aim to enhance feature representations by leveraging inherent data characteristics without explicit labels. BM3 and MMGCL exemplify such approaches, applying contrastive learning to refine model predictions.

Evaluation and Challenges

The survey highlights common evaluation metrics such as Recall, NDCG, and MAP, crucial for gauging recommendation accuracy. However, the paper notes the need for standardized datasets and evaluation protocols due to inconsistent data splitting strategies across studies. This inconsistency complicates model performance comparisons and real-world application relevance.

Future Directions

The paper identifies several challenges and potential research avenues:

Effective Modality Fusion: Determining optimal ways to combine multimodal data without losing modality-specific information remains an open research question.
Standardization of Data and Evaluation: Establishing common datasets and evaluation metrics can greatly facilitate comparability and robustness across studies.
Cross-domain and Sequential Recommendations: Leveraging multimodal data for cross-domain scenarios and capturing user behavior sequences can enhance recommendation systems' adaptability and accuracy in diverse real-world settings.

Conclusion

Through a methodologically organized review, this paper provides a comprehensive understanding of the state-of-the-art in multimodal recommender systems. It underscores the necessity for innovative techniques in modality fusion and calls for standardized evaluation practices to advance this dynamic field. The research underscores the potential of multimodal approaches to revolutionize recommendation systems, providing richer, more personalized user experiences.