- The paper introduces PMMRec, a multi-modal recommender framework that integrates text and image data to tackle cold-start and transferability issues.
- The methodology leverages cross-modal contrastive learning and self-supervised denoising to enhance robustness and achieve superior hit ratio and NDCG performance.
- The architecture’s modular design enables independent pre-training and flexible transfer across domains, supporting various deployment environments.
Multi-Modality is All You Need for Transferable Recommender Systems
Introduction
The paper investigates the inherent limitations of ID-based recommender systems in addressing the cold-start problem and transferability issues, proposing an innovative alternative based on multi-modal data representations. Drawing from the notion that user interaction patterns, namely transition patterns of behaviors across different platforms, can be effectively captured through multi-modal item embeddings, PMMRec is introduced as a flexible framework designed to leverage text and image contents for cross-domain recommendation tasks.
Figure 1: An example from the HM dataset and the Bili dataset. Although the content similarities between different platforms might be low, the commonalities of universal transition patterns (e.g., next-item transition) between different platforms are still high, making it beneficial to transfer knowledge across different domains and platforms.
Architecture of PMMRec
Framework Components
PMMRec adopts a modular architecture comprising item encoders, a fusion module, and a user encoder, allowing its components to be independently pre-trained and transferred across platforms. The item encoders (text encoder and vision encoder) leverage pre-trained models like RoBERTa and Vision Transformer to extract modality-specific feature embeddings. The fusion module synthesizes these embeddings into a coherent multi-modal representation, subsequently processed by a transformer-based user encoder to model user behavior sequences:
Figure 2: The architecture of PMMRec. Item and user encoders are coupled with a multi-modal fusion module, enabling representation alignment and robustness enhancement.
Objectives and Learning
Cross-modal Contrastive Learning
To align the representations of different modalities effectively, PMMRec uses a Next-item enhanced cross-modal Contrastive Learning (NICL) objective. NICL not only aligns the text and image modalities but also incorporates next-item positive samples to embed recommendation semantics directly into the item encoders, thereby facilitating robust transfer learning across platforms.
Self-supervised Denoising
PMMRec introduces two self-supervised objectives—Noised Item Detection (NID) and Robustness-aware Contrastive Learning (RCL)—to combat inherent data noise issues. NID adapts the model to synthetic noise by labeling items that have undergone perturbations like shuffling. RCL further strengthens the robustness by contrasting original with corrupted user sequences, ensuring stability in recommendations across varied domains.
Empirical Evaluation
PMMRec demonstrates superior performance over state-of-the-art recommender systems across numerous datasets. Extensive experiments highlight its efficacy in terms of hit ratio and normalized discounted cumulative gain, showcasing significant improvements in solving cold-start issues.
Transfer Learning Versatility
The framework supports multiple transfer learning settings: full model transfer, item encoder-only transfer, user encoder-only transfer, and modality-specific transfers (text or vision). Each setting caters to different operational requirements, revealing PMMRec’s versatility in adapting to both resource-rich and resource-constrained deployment environments.
Figure 3: Convergence curves on downstream datasets under different transfer learning settings.
Implications and Future Directions
PMMRec opens pathways toward a more generalized recommendation paradigm where traditional ID constraints are abolished. The advent of foundation models in NLP and CV can further enhance PMMRec's capabilities, driving it toward unifying recommendation modeling with broader AI systems. Research efforts should focus on exploring additional modalities and optimizing computational efficiency for real-time applications in complex scenarios, such as dynamic multi-behavior and real-time recommendation tasks.
Conclusion
PMMRec demonstrates promise as a versatile, transferable recommender system framework, offering substantial improvements in tackling cold-start issues and enhancing cross-domain applicability. Future research should aim to integrate other multimodal data types and refine learning objectives for more adaptive general AI models in the recommendation context.