Multimodal Recommender Systems

Updated 12 November 2025

Multimodal recommender systems are frameworks that combine heterogeneous data modalities to capture user preferences and address cold-start issues.
They employ modality-specific extractors (e.g., BERT for text, ResNet for images) and utilize early, late, or hybrid fusion strategies to integrate features.
Recent research emphasizes dynamic modality weighting, interpretability, and end-to-end fine-tuning to boost recommendation accuracy and efficiency.

Multimodal recommender systems (MRS) extend conventional collaborative filtering by integrating information from heterogeneous data sources—text, images, video, audio—to more accurately model user preferences, alleviate data sparsity, and provide robust recommendations under cold-start and missing-modality conditions. By combining diverse content modalities, MRS are able to represent item and user characteristics at a finer semantic granularity and leverage deep interactions between modalities that classical, unimodal systems cannot capture.

1. Formal Definition, Problem Setting, and Key Challenges

Formally, let $\mathcal{U}$ and $\mathcal{I}$ denote the sets of users and items, with interaction observations encoded as a binary matrix $R = [r_{u,i}] \in \{0,1\}^{|\mathcal{U}| \times |\mathcal{I}|}$ , where $r_{u,i}=1$ indicates user $u$ interacted with item $i$ . Each item $i$ is associated with $M$ modalities of side-content features $X_i^m$ (e.g., textual description, image, audio clip, video). The goal is to estimate a scoring function

$\hat{y}_{u,i} = f\bigl(R, u, i;\; \{ X_i^m \}_{m=1}^{M} \bigr)$

that represents the likelihood of future interaction.

MRS must address several specific challenges:

Cold-start: As new users/items may lack interaction history, prediction must rely on content modalities.
Scalability: Real-world recommendation involves very large user/item graphs, mandating compact encoders and efficient fusion.
Missing modalities: Not every item will have all modalities available (e.g., a product without video).
Semantic gap: Low-level features (e.g., pixels, spectrograms) may not correspond directly to human or user-preference semantics.

2. Modality-Specific Feature Extraction

Each modality is preprocessed and embedded into a dense vector space. Typically, a modality-specific extractor $\mathcal{F}^m$ is applied to raw content $X_i^m$ , yielding

$f_i^m = \mathcal{F}^m(X_i^m) \in \mathbb{R}^{d_m}$

Representative extractors include:

Modality	Extractors	Output Dimensionality
Text	TF-IDF, Word2Vec, BERT, RoBERTa, Sentence-Transformer	300–1,024
Image	VGG, ResNet, Inception, ViT, EfficientNet	512–4,096
Video	C3D, two-stream, frame-wise Transformers	up to several thousand
Audio	Spectrogram+CNN, MFCC+LSTM/GRU, raw waveform encoders	128–1,024

The embeddings $f_i^m$ may be precomputed via frozen backbones or updated during downstream training, depending on resource trade-offs.

3. User and Item Encoder Architectures

After extracting modality embeddings, user and item representations are computed. Typical encoders:

MLP (Multi-Layer Perceptron):

$h = \sigma(W_2 \,\sigma( W_1 [ f_u ; f_i ] ))$

RNN/GRU/LSTM: For sequential modeling over user histories.
Transformer: Self-attention across user histories or sets of modality embeddings, e.g.,

$A = \text{softmax}\left( \frac{Q K^T}{\sqrt{d}} \right)V$

Graph Neural Networks (GNNs): Encoding the user–item bipartite graph $G=(\mathcal{U} \cup \mathcal{I}, E)$ via propagation rules, e.g.,

$E^{(\ell)} = \tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2} E^{(\ell-1)}$

where $\tilde{A} = A + I$ .

Aggregation of user histories often combines sequence encoders (RNNs, self-attention) with multimodal item features, sometimes fusing before, sometimes after graph propagation.

4. Multimodal Fusion Strategies

Fusion of modality embeddings is the centerpiece of MRS design, dictating the level and nature of interaction between data sources. Canonical strategies are:

Early Fusion ("feature fusion"): Concatenate or aggregate all modality embeddings before passing to the encoder:

$E_i = \text{Aggr}(\{ f_i^m\}_{m=1}^{M})$

Typical operators: concatenation, element-wise sum/product, bilinear pooling.

Late Fusion ("score fusion"): Compute independent predictions per modality, then combine (sum, weighted sum, attention):

$\hat{y}_{u,i} = \sum_m \alpha_m f_{u,i}^m$

with $\sum_m \alpha_m = 1$ .

Hybrid Fusion: Combine both levels—e.g., early fusion at input combined with attention or gating at the scoring stage.
Cross-modal Attention: Use a shared attention mechanism to weight contributions per modality or between features, e.g.,

$\alpha_{ij} = \text{softmax}( h_i^T W h_j )$

Table: Fusion Techniques

Fusion Type	Pros	Cons
Early fusion	Captures low-level cross-modal interactions	Increases input dimensionality, sensitive to missing modalities
Late fusion	Graceful handling of missing data, modular	Potentially misses higher-order cross-modal correlations
Hybrid/attention	Adaptive, supports per-user/item weighting	Higher complexity, nontrivial to optimize

The choice of fusion mechanism is task- and domain-dependent; empirical results favor hybrid models in cold-start or highly sparse regimes.

5. Loss Functions and Training Objectives

MRS frameworks optimize objectives rooted in ranking or rating prediction, often integrating supervised and self-supervised components:

Pointwise (cross-entropy) loss:

$\mathcal{L}_{CE} = -\sum_{(u,i)} \left[ r_{u,i}\log\sigma(\hat{y}_{u,i}) + (1 - r_{u,i})\log(1 - \sigma(\hat{y}_{u,i})) \right]$

Pairwise (BPR) loss:

$\mathcal{L}_{BPR} = -\sum_{(u,i^+,i^-)} \log\sigma(\hat{y}_{u,i^+} - \hat{y}_{u,i^-})$

Multimodal regularization (e.g., alignment penalties):

$\mathcal{L}_{align} = \|h_{text} - h_{img}\|^2$

Self-supervised objectives:
- Contrastive InfoNCE:
$\mathcal{L}_{InfoNCE} = -\mathbb{E}\left[ \log \frac{\exp(f(z_i, z_i^+))}{\sum_j \exp(f(z_i, z_j^-))} \right]$ - Reconstruction/Masked-Modality Prediction.

Recent work employs auxiliary losses to improve representation robustness and cross-modal alignment, often yielding increased accuracy and generalization.

6. Evaluation Protocols, Datasets, and Empirical Findings

Evaluation of MRS occurs on public benchmarks with side-content: Amazon (books, electronics, clothing), MovieLens (w/ images), Yelp (w/ reviews/photos), Taobao/TikTok (short video/audio). Core metrics:

Metric	Definition
HR@K	Hit Rate at K
NDCG@K	Normalized Discounted Cumulative Gain at K
MRR	Mean Reciprocal Rank
Precision/K	Precision at K
Recall/K	Recall at K

Empirical results consistently find that MRS with proper fusion (particularly early or cross-modal attention-based, often utilizing graph encoders) outperform unimodal or ID-only baselines by 5–15% relative in NDCG@10, especially in cold-start and sparse regimes.

Notably, the interaction between modality, fusion strategy, and encoder architecture determines gains:

Early/cross-modal fusion boosts low-data regimes, at the expense of stability and parameter count.
Late fusion handles missing modalities, offering modular deployment.

A strong empirical guideline is to align the dimensions of embeddings post-extraction and to tune fusion operators for the target recommendation scenario.

7. Open Problems and Future Directions

Principal research directions in MRS include:

Dynamic Modality Weighting: Learn user/item-specific attention over modalities, superseding static global fusion weights.
Interpretability: Develop approaches capable of explaining recommendations and attributing them to specific modalities and their semantically relevant components.
End-to-End Joint Fine-Tuning: Replace static pre-trained extractors with recommender-aware fine-tuning of feature backbones, closing the semantic gap.
Efficient Multimodal Pre-training: Leverage large-scale self-supervised pre-training on user–item–content graphs to seed encoder and fusion layers, akin to CLIP/ALIGN paradigms.
Extension beyond Text and Image: Incorporate audio, video, 3D, and sensor data; exploit generative cross-modal models (e.g., text→image synthesis) to alleviate missing-data scenarios.

Systematic integration of advanced feature extractors, flexible fusion and encoder architectures, and jointly optimized supervised/self-supervised objectives will further enhance the robustness, sparsity tolerance, and transparency of multimodal recommender systems (Xu et al., 22 Jan 2025).

PDF Markdown Chat (Pro)

References (1)

A Survey on Multimodal Recommender Systems: Recent Advances and Future Directions (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Multimodal Recommender System.