Multimodal Recommender Systems

Updated 12 November 2025

Multimodal Recommender Systems are advanced frameworks that integrate multiple data modalities like text, images, audio, and video for comprehensive user-item modeling.
They address critical issues such as cold-start, data sparsity, and explainability by leveraging diverse content cues and sophisticated fusion strategies.
Current research focuses on robust multimodal fusion techniques, dynamic attention mechanisms, and the integration of foundation models to enhance scalability and interpretability.

Multimodal Recommender Systems (MMRSs) represent a paradigm in recommendation research where systems leverage multiple heterogeneous data modalities—such as text, images, audio, and video—for user and item representation, preference inference, and ranking. These systems aim to surpass the limitations of unimodal recommenders, particularly with respect to semantic richness, cold-start and sparsity, diversity, and explainability. The field encompasses a broad spectrum, including supervised and self-supervised architectures, deep multimodal fusion mechanisms, graph-based propagation, adversarial and robust optimizations, and powerful foundation models such as Multimodal LLMs (MLLMs). The following sections provide a comprehensive overview of the problem definition, methodological landscape, fusion strategies, performance impact, theoretical foundations, and open research directions in MMRSs.

1. Problem Definition and Motivations

A Multimodal Recommender System jointly processes multiple modalities associated with each item $i \in \mathcal{I}$ and, potentially, user $u \in \mathcal{U}$ . For item $i$ , modalities may include textual descriptions, images, audio clips, or video segments. The classical recommendation objective,

$\hat y_{u,i} = f(z_u, z_i; \Theta),$

where $z_u, z_i$ are learned user and item embeddings, is extended to

$\hat y_{u,i} = f(z_u, \{ x^{(m)}_i \}_{m \in M}; \Theta),$

where $x^{(m)}_i$ is the feature vector for modality $m$ and $f$ is a function—often a neural ranking or matching module—that combines behavioral and content features (Xu et al., 22 Jan 2025, Liu et al., 2023).

Core motivations include:

Semantic Enrichment: Capturing visual style, audio cues, and behavioral context that text alone cannot represent (Liu et al., 31 Mar 2024).
Cold-Start and Data Sparsity Relief: Leveraging universally available content embeddings when interaction signals are lacking (Liu et al., 2023, Zhou et al., 7 Aug 2025).
Preference and Diversity Enhancement: Modeling nuanced user tastes and surfacing long-tail (unpopular) items via content-side cues (Lin et al., 17 Jul 2024, Malitesta et al., 2023).
Explainability: Tracing recommendations to interpretable content attributes such as color, category, or sentiment.

Each modality is processed by a dedicated encoder, producing $d_m$ -dimensional representations:

Text: BERT (Xu et al., 22 Jan 2025), Sentence-BERT, or custom RNN/LSTM models (Xv et al., 18 Jun 2024, Liu et al., 31 Mar 2024).
Vision: CNNs (ResNet, VGG), Vision Transformers (ViT), and Large Vision-LLMs (LVLMs) for image regions or global features (Pomo et al., 6 Aug 2025, Liu et al., 31 Mar 2024).
Audio/Video: Spectrogram-based CNNs, 3D ConvNets, or temporal models (Xu et al., 22 Jan 2025, Ramisa et al., 17 Sep 2024).

The outputs $\{e_i^{(m)}\}_{m\in M}$ for item $i$ are drawn into a unified representation via fusion (cf. Section 3). In advanced systems (e.g., MLLMRec), image inputs may be translated to high-level semantic text descriptions using MLLMs, then combined with raw textual metadata for downstream encoding (Dang et al., 21 Aug 2025).

3. Multimodal Fusion Strategies

Integrating modalities—termed "fusion"—is the central technical challenge in MMRSs. The fusion taxonomy includes (Liu et al., 2023, Lopez-Avila et al., 14 May 2025):

Early Fusion: Combine pre-encoded embeddings at input (concatenation, sum, attention-weighted sum), then downstream encoding (e.g., via LightGCN) (Xu et al., 22 Jan 2025, Malitesta et al., 2023).
Late Fusion: Independent encodings per modality, predictions from each, aggregated via weighted sum or gating at output (Zhou et al., 7 Aug 2025, Lopez-Avila et al., 14 May 2025).
Cross-Attention: Fine-grained co-attention between modalities (e.g., text/vision), often at multiple neural layers; instantiated in architectures like CADMR (Khalafaoui et al., 3 Dec 2024) and ALBEF (Ramisa et al., 17 Sep 2024).
Manifold-Aware Fusion: Spherical Bézier or slerp-based combining of normalized embeddings, maintaining representations on the hyperspherical manifold as in CM³ (Zhou et al., 2 Aug 2025).
Disentangled/Augmented Fusion: Learning both modality-shared and modality-specific representations with contrastive or difference amplification regularizers (e.g., MDE (Zhou et al., 8 Feb 2025)) and dynamic node-level trade-off weighting.

Table: Fusion Strategies in Representative MMRSs

Method	Fusion Type	Model Example(s)
Early (concat/sum)	Early	VBPR, MMGCN, GUME
Late (ensemble)	Late	MGCE, MCLN
Cross-attention	Intermediate	CADMR, ALBEF
Spherical Bézier/slerp	Manifold-aware	CM³
Disentangled/Dynamic	Mixed	MDE, MMSR

Late (ensemble) fusion often preserves modality-specific signals and avoids overfitting to spurious correlations (Zhou et al., 7 Aug 2025, Zhou et al., 8 Feb 2025).

4. Learning Objectives, Model Architectures, and Training

Most MMRSs are trained with pairwise ranking losses such as BPR: $\mathcal{L}_{\mathrm{BPR}} = -\sum_{(u,i,j)\in S}\ln \sigma(\hat y_{u,i} - \hat y_{u,j}) + \lambda\|\Theta\|_2^2,$ supplemented by modality-alignment objectives:

InfoNCE/Contrastive Losses: Forcing positive (same-item, cross-modality) pairs together and contrasting against negatives (Liu et al., 31 Mar 2024, Xu et al., 22 Jan 2025).
Self-Supervised Modality Matching: Regularizing similar/fluent embeddings for shared item semantics while discriminating unique features (Zhou et al., 8 Feb 2025, Khalafaoui et al., 3 Dec 2024).
Noise-Robust Losses: Denoised BPR (D-BPR) mixing correct/incorrect feedback based on content reliability estimation (Xv et al., 18 Jun 2024).
Uniformity and Alignment: As in CM³, aligning user/item neighbors while enforcing uniform embedding spread, with calibrated repulsion for dissimilar multimodal pairs (Zhou et al., 2 Aug 2025).

Architecturally, MMRSs use:

GCN-style Graph Neural Networks: To propagate multimodal and behavioral signals over user–item or item–item graphs (LightGCN, MMGCN, GUME) (Lin et al., 17 Jul 2024, Xu et al., 22 Jan 2025).
Autoencoders with Cross-Attention: CADMR pretrains disentangled encoders and refines user-item reconstructions with multi-head attention on fused embeddings (Khalafaoui et al., 3 Dec 2024).
MLLM-driven Summarization: Emergent paradigm using MLLMs to generate item and user summaries from multimodal content, and fine-tuned prediction heads for end-to-end sequential ranking (Ye et al., 19 Aug 2024, Dang et al., 21 Aug 2025).
Adapter and Parameter-efficient Fine-tuning: LoRA, adapters, and soft-prompting enable efficient integration of large multimodal backbones (Lopez-Avila et al., 14 May 2025).

5. Impact, Evaluation, and Empirical Trends

MMRSs have demonstrated clear empirical advantages across standard Top-K ranking metrics: Recall@K, NDCG@K, HR@K, MRR@K, and AUC.

Key findings include:

Performance Under Sparsity: Multimodal gains are most pronounced for users/items with few interactions and in the recall (candidate generation) stage (Zhou et al., 7 Aug 2025, Liu et al., 31 Mar 2024). For instance, in MGCE, Recall@20 improves by over 60% for cold-start users.
Domain and Modality Effects: Textual features predominate in e-commerce, while visual features are critical in video or fashion domains (Zhou et al., 7 Aug 2025). In MMRSs, text-only models match or beat multimodal counterparts in 6/11 e-commerce cases; visual-only often excels on short-video.
Cold Start Robustness: In CADMR and CM³, MMRSs outperform baselines when training data is reduced to 20%, and calibration (e.g., in CM³) further promotes performance on unseen items (Zhou et al., 2 Aug 2025, Khalafaoui et al., 3 Dec 2024).
Modality Contribution: Ablation studies reveal that both alignment and distinction objectives matter (MDE: −5% Recall@5 w/o either), and node-level or user-aware fusion improves over static weighting (Zhou et al., 8 Feb 2025, Lin et al., 17 Jul 2024).
Model-Scale Paradox: Larger models do not guarantee superior results; architecture and integration strategy dictate final performance (Zhou et al., 7 Aug 2025).

Selected empirical highlight: in CADMR (Khalafaoui et al., 3 Dec 2024), NDCG@10 improves 400% over SOTA baselines on Amazon datasets, and CM³’s calibrated uniformity mechanism plus MLLM features yields up to +5.4% NDCG@20 (Zhou et al., 2 Aug 2025).

6. Theory, Challenges, and Fairness

Theoretical Principles

Alignment and Uniformity: On the hypersphere, contrastive learning seeks to align true user–item pairs while enforcing global embedding dispersion for negative sampling efficacy (Zhou et al., 2 Aug 2025).
Calibration via Side Information: CM³ shows that uniform repulsion can be attenuated for semantically similar items, ensuring alignment is not sacrificed for uniformity (Zhou et al., 2 Aug 2025).
Disentanglement: DRL methods (e.g., PAMD, CADMR) seek to untangle shared from unique modal factors, typically with a Total Correlation penalty (Khalafaoui et al., 3 Dec 2024, Zhou et al., 8 Feb 2025).

Robustness and Noise

Noise-Robust Training: Mirror Gradient implicitly regularizes the gradient norm to favor flat minima, yielding stability against input perturbations and feedback noise (Zhong et al., 17 Feb 2024).
Denoising: DA-MRS explicitly prunes noisy modality or behavior graphs, and DA-MRS’s denoised BPR loss models accidental clicks with Bernoulli mixing (Xv et al., 18 Jun 2024).

Popularity Bias and Diversity

Bias Amplification: Multimodal features do not automatically improve long-tail exposure; single-modality models, especially with visual features, can worsen concentration on popular items (low APLT, iCov) (Malitesta et al., 2023).
Mitigation: Modal-aware regularization, causal weighting, adversarial re-ranking, and explicit coverage constraints are required for fair MMRS operation (Malitesta et al., 2023, Zhou et al., 7 Aug 2025).

7. Current Trends and Open Research Directions

Recent advances include:

MLLMs as Foundation Models: Direct use of vision-LLMs (e.g., BLIP-2, LLaVA, Qwen2-VL) for item and user summarization, with structured prompts for semantically rich, interpretable embeddings (Dang et al., 21 Aug 2025, Pomo et al., 6 Aug 2025, Ye et al., 19 Aug 2024).
Hybrid and Plug-and-Play Pipelines: MMRSs now support modular integration of new modalities, graph augmentation, and robust learning routines, e.g., LightGCN + DA-MRS or GUME modules (Lin et al., 17 Jul 2024, Xv et al., 18 Jun 2024).
Interpretability and Semantic Transparency: Decoded LVLM embeddings yield human-readable attribute lists that support explainable recommendations and competitive recall in pure content-based hybrids (Pomo et al., 6 Aug 2025).
Scalable, Real-Time Serving: Adapters, quantization, and two-stage retrieval strategies are being explored to control inference costs for long histories or wide candidate sets (Ye et al., 19 Aug 2024, Lopez-Avila et al., 14 May 2025).
Conversational and Agent-Based MRSs: The embedding of MMRSs in multi-agent or conversational frameworks—incorporating feedback loops and market awareness—expands the applicability to dynamic, real-world contexts (Thakkar et al., 22 Oct 2024).
Dynamic Fusion and Node-Level Gating: Moving beyond global fusions with dynamic, user- or item-specific attention/gating—improving both accuracy and fairness (Zhou et al., 8 Feb 2025, Hu et al., 2023).

Ongoing challenges center on:

Robustness to modality noise and missing data
Cross-domain and novel modality transfer
Unified end-to-end training of multimodal, graph, and summarization modules
Scalable fusion and attention mechanisms
Fairness, transparency, and user trust

The breadth of MMRS research underscores the need for continuous benchmarking across modalities, domains, and tasks; careful model design and ablation; and integration of foundation models with explainable, context- and task-aware fusion strategies.