Learning Item Representations Directly from Multimodal Features for Effective Recommendation (2505.04960v1)

Published 8 May 2025 in cs.IR and cs.MM

Abstract: Conventional multimodal recommender systems predominantly leverage Bayesian Personalized Ranking (BPR) optimization to learn item representations by amalgamating item identity (ID) embeddings with multimodal features. Nevertheless, our empirical and theoretical findings unequivocally demonstrate a pronounced optimization gradient bias in favor of acquiring representations from multimodal features over item ID embeddings. As a consequence, item ID embeddings frequently exhibit suboptimal characteristics despite the convergence of multimodal feature parameters. Given the rich informational content inherent in multimodal features, in this paper, we propose a novel model (i.e., LIRDRec) that learns item representations directly from these features to augment recommendation performance. Recognizing that features derived from each modality may capture disparate yet correlated aspects of items, we propose a multimodal transformation mechanism, integrated with modality-specific encoders, to effectively fuse features from all modalities. Moreover, to differentiate the influence of diverse modality types, we devise a progressive weight copying fusion module within LIRDRec. This module incrementally learns the weight assigned to each modality in synthesizing the final user or item representations. Finally, we utilize the powerful visual understanding of Multimodal LLMs (MLLMs) to convert the item images into texts and extract semantics embeddings upon the texts via LLMs. Empirical evaluations conducted on five real-world datasets validate the superiority of our approach relative to competing baselines. It is worth noting the proposed model, equipped with embeddings extracted from MLLMs and LLMs, can further improve the recommendation accuracy of NDCG@20 by an average of 4.21% compared to the original embeddings.

Summary

The paper introduces LIRDRec, a novel recommender that learns item representations solely from multimodal (visual and textual) features without relying on item IDs.
It employs a dual mechanism of feature transformation and progressive weight copying to dynamically integrate modality-specific information and accelerate convergence.
Experimental results on diverse datasets demonstrate that LIRDRec outperforms traditional ID-based models in accuracy and cold-start scenarios.

This paper, "Learning Item Representations Directly from Multimodal Features for Effective Recommendation" (2505.04960), proposes a novel multimodal recommender system (MMRS) called LIRDRec that learns item representations directly from multimodal features, specifically visual and textual information, without relying on traditional item identity (ID) embeddings.

The paper identifies a limitation in existing MMRSs that typically combine item ID embeddings with multimodal features for representation learning, often optimizing using pairwise losses like Bayesian Personalized Ranking (BPR). Through empirical analysis on datasets and theoretical proof, the authors demonstrate a significant optimization gradient bias favoring multimodal features over randomly initialized item ID embeddings, especially in early training stages. This bias can lead to suboptimal item ID embeddings despite the convergence of multimodal feature parameters, hindering the full exploitation of rich multimodal information.

To address this, LIRDRec introduces a paradigm shift by eliminating item ID embeddings and constructing item representations solely from multimodal features. The core idea is to directly leverage the inherent information content of multimodal data for recommendation.

LIRDRec employs a dual-pronged strategy:

Multimodal Feature Transformation (MFT): This mechanism aims to enhance uni-modal features and capture cross-modal relationships.
- Modality-specific Feature Projection: Each uni-modal feature (e.g., image or text embeddings) is projected into a shared low-dimensional space using a Deep Neural Network (DNN). For item $i$ with modality $m$ , the latent representation $\hat{\mathbf{X}}^m_i$ is derived via:
  
  $\hat{\mathbf{X}}^m_i = \phi({\mathbf{X}^m_i \mathbf{W}_1^m + \mathbf{b}_1^m}) \mathbf{W}^m_2$
  
  where $\mathbf{X}^m_i$ is the initial feature vector, $\phi$ is Leaky ReLU, and $\mathbf{W}$ and $\mathbf{b}$ are trainable parameters.
- 2-D Discrete Cosine Transform (DCT): DCT is applied to the modality-specific feature matrices $\mathbf{X}^m$ (dimension $|\mathcal{I}| \times d_m$ ) as a pre-processing step to decorrelate features. This helps in capturing comprehensive contextual information.
- Shared Representation: The transformed (DCT-applied) modality-specific features are concatenated and fed into another DNN to learn a shared multimodal representation $\hat{\mathbf{X}}^{sh}$ .
  
  $\hat{\mathbf{X}}^{sh} = \phi\Bigl( \bigl(T(\mathbf{X}^{0}) \oplus \cdots \oplus T(\mathbf{X}^{|\mathcal{M}|-1}) \bigr) \mathbf{W}_1^{sh} + \mathbf{b}_1^{sh} \Bigr) \mathbf{W}^{sh}_2$
These resulting latent features ( $\hat{\mathbf{X}}^0, \ldots, \hat{\mathbf{X}}^{|\mathcal{M}|-1}, \hat{\mathbf{X}}^{sh}$ ) are concatenated to form the initial latent item representation $\widetilde{\mathbf{X}}$ .
Progressive Weight Copying (PWC): This module differentiates the influence of diverse modality types by assigning weights to different segments (chunks) of the user or item representations.
- A user representation $\mathbf{E}_u$ (obtained from graph learning) is split into $B$ chunks.
- Each chunk $\mathbf{E}_u^b$ is processed by a DNN to obtain a weight $a_b$ .
  
  $a_b = \phi({\mathbf{E}_u^b \mathbf{W}_1^b + \mathbf{b}_1^b}) \mathbf{W}^b_2 + \mathbf{b}_2^b$

* Weights are progressively copied to a target network ( $\theta_b$ ) using an exponential decay function involving a decay base $\gamma$ and decay rate $\tau$ . This ensures dynamic and smooth weight adjustment.

$\eta = \gamma^n$

$a_b = \frac{\tau \cdot \eta}{\tau \cdot \eta + 1 - \tau} \cdot a_b + \frac{1 - \tau}{\tau \cdot \eta + 1 - \tau} \cdot \theta_b$

$\tau = \tau \cdot \eta; \quad \theta_b = a_b$

* The final user representation $\mathbf{H}_u$ is formed by concatenating the weighted chunks:

$\mathbf{H}_u = a_0 \mathbf{E}_u^0 \oplus a_1 \mathbf{E}_u^1 \oplus \cdots \oplus a_{B-1} \mathbf{E}_u^{B-1}$

PWC differs from traditional attention by learning weights via DNNs per chunk rather than dot products and progressively copying weights to a target network.

Graph Learning is integrated into LIRDRec. User ID embeddings $\mathbf{E}$ are still used initially to capture user preferences but are propagated through LightGCN layers with the concatenated item representations $\widetilde{\mathbf{X}}$ on a user-item bipartite graph $\mathcal{G}$ . Item representations $\widetilde{\mathbf{X}}$ are also propagated on a pre-constructed item-item graph $\mathbf{S}$ (based on multimodal similarity, similar to FREEDOM (2505.04960)) to obtain the final item representations $\widetilde{\mathbf{H}}_i$ . User and item representations are learned via summing representations from multiple LightGCN layers:

$\mathbf{E}_u = \mathrm{READOUT}(\mathbf{E}_u^0, \ldots, \mathbf{E}_u^{L_{ui}})$

$\widetilde{\mathbf{X}}_i = \mathrm{READOUT}(\widetilde{\mathbf{X}}_i^0, \ldots, \widetilde{\mathbf{X}}_i^{L_{ui}})$

Final item representation $\widetilde{\mathbf{H}}_i$ is obtained after propagation on the item-item graph, typically with a residual connection:

$\widetilde{\mathbf{H}}_i = \widetilde{\mathbf{X}}_i^{L_{ii}} + \widetilde{\mathbf{X}}_i^{0}$

The model is optimized using the pairwise BPR loss with L2 regularization on user and item representations:

$\mathcal{L} = \sum_{(u,i,j)\in \mathcal{R}} \left(-\mathrm{log} \sigma(\mathbf{H}_u^\top \widetilde{\mathbf{H}}_i - \mathbf{H}_u^\top \widetilde{\mathbf{H}}_j)\right) + \lambda \cdot (||\mathbf{H}_u||^2_2 + ||\widetilde{\mathbf{H}}_i||^2_2)$

Recommendation scores are calculated as the dot product $\mathbf{H}_u^\top \widetilde{\mathbf{H}}_i$ .

For implementation, the paper utilizes pre-extracted visual (4096-dim) and textual (384-dim) features. Additionally, it explores using features extracted via Multimodal LLMs (MLLMs) and LLMs. Specifically, images are converted to text using Meta's "Llama-3.2-11B-Vision", and embeddings for both image-derived text and original text descriptions are extracted using "e5-mistral-7b-instruct" (4096-dim).

Experiments are conducted on five public datasets (Baby, Sports, Clothing, Electronics, MicroLens) from Amazon reviews and a short-video platform. LIRDRec is compared against collaborative filtering baselines (MF, LightGCN) and state-of-the-art multimodal baselines (VBPR, GRCN, LATTICE, SLMRec, MICRO, BM3, FREEDOM, LGMRec). Evaluation metrics are Recall@K and NDCG@K (for K=10 and 20) using the all-ranking protocol.

Key experimental findings:

General Performance: LIRDRec consistently outperforms all baselines across all datasets. Using MLLM-based features (LIRDRec $_\textrm{MLLM}$ ) further boosts performance, achieving an average of 4.21% NDCG@20 improvement over the standard LIRDRec and significantly outperforming other multimodal models using traditional features.
Quick Startup: LIRDRec shows much faster convergence in early training epochs compared to baselines like FREEDOM and BM3, indicating its ability to rapidly learn effective representations from multimodal features without relying on historical interactions encoded in ID embeddings.
Cold-Start Performance: LIRDRec is more robust in cold-start scenarios (items unseen during training) compared to baselines. This is attributed to its direct use of multimodal features and the information propagation capability of GCNs on the item-item graph, allowing unseen items to benefit from the learned feature space and item similarity structures.
Ablation Study: The PWC module has the most significant impact on performance, highlighting the importance of dynamically weighting modality contributions. The 2-D DCT transformation also contributes positively. An interesting observation is that LIRDRec trained only on textual features (LIRDRec $_\textrm{w/o V}$ ) can outperform models using both modalities but lacking the shared representation learned via DCT. This emphasizes the effectiveness of the MFT component. Textual features are found to be more important than visual features in the evaluated datasets.
Hyperparameter Sensitivity: Performance is more sensitive to the regularization coefficient $\lambda$ than the decay rate $\tau$ . Larger decay rates ( $\ge 0.9$ ) for PWC and a higher number of GCN layers in the user-item graph (especially for larger graphs) tend to yield better results.

The paper concludes that learning item representations directly from multimodal features is a viable and effective alternative to ID-based approaches, leading to improved recommendation performance and faster startup. The MFT and PWC modules are crucial for effectively processing and combining multimodal information. Using MLLM-extracted features further enhances performance. Future work includes extending LIRDRec to sequential recommendation and incorporating the PWC technique for modeling temporal preference shifts across modalities. A case paper demonstrates the utility of MLLMs in generating descriptive text for items lacking such information, enriching the multimodal data available for the recommender.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos