Multimodal Sequential Recommendation

Updated 22 June 2026

Multimodal Sequential Recommendation is a dynamic paradigm that integrates sequential user interactions with heterogeneous item features such as images, text, and metadata.
It employs advanced techniques like modality alignment, cross-modal fusion, and noise-robust denoising to address challenges in representation and cold-start scenarios.
Architectural innovations including graph-based fusion, diffusion-based denoising, and parameter-efficient fine-tuning drive significant empirical performance improvements.

Multimodal Sequential Recommendation (MSR) is the branch of recommender systems addressing dynamic user preference modeling over time by leveraging temporally ordered interaction histories and heterogenous side information available for each item, such as images, text, and category metadata. The MSR paradigm enhances item and user representation beyond traditional ID-based approaches, dramatically improving recommendation accuracy, robustness to cold-start and domain transfer, and interpretability. MSR research encompasses architectural advances, efficient fine-tuning, novel fusion and denoising strategies, modality alignment constraints, and fine-grained analysis of user-item relations across modalities.

1. Problem Formulation and Systematic Challenges

The canonical MSR setting involves a user set $U = \{u_1, \ldots, u_{|U|}\}$ , an item set $V = \{i_1, \ldots, i_{|V|} \}$ , and a per-user time-ordered history recording interactions $S_u = (s_{u,1}, \ldots, s_{u,T})$ . Each $i \in V$ is described by multiple modalities $M = \{$ image, text, ID, category $\}$ , with features $x^m_i$ . The central task is to learn a scoring function $f_\theta$ ranking candidate items based on user $u$ 's historical interactions and multimodal signals, predicting the next item $i_{T+1}$ with maximal relevance.

Key technical challenges include:

Representation alignment: Bridging the semantic gap between pretrained image/text embeddings (e.g., CLIP, ViT, BERT) and the representation space optimal for sequential recommendation (Fan et al., 3 Jun 2025, Zhong et al., 8 Nov 2025).
Cross-modal fusion: Integrating heterogeneous features at various abstraction levels and with dynamic weighting depending on context, user, or item category (Xu et al., 4 Mar 2026, Zhang et al., 16 Jan 2026).
Modality difference and synergy: Capturing the fact that user interests and item relationships vary by modality, requiring explicit modeling of modality-specific graphs and their interactions (Li et al., 2024, Zhang et al., 16 Jan 2026, Xu et al., 4 Mar 2026).
Robustness to noise and sparsity: Dealing with implicit feedback noise (e.g., spurious clicks), missing modalities, as well as cold items or domains unseen during training (Cui et al., 7 Aug 2025, Li et al., 2024).
Efficiency and scalability: Achieving fast transfer/fine-tuning with limited resources, and supporting practical deployment (e.g., via parameter-efficient methods) (Fu et al., 2024, Zhong et al., 8 Nov 2025, Fan et al., 3 Jun 2025).

2. Architectural Approaches: From Dual-Tower to Graph and Diffusion Models

A diversity of architectural paradigms has been explored for MSR:

ID-Agnostic and Universal Pipelines: MMSR replaces ID embeddings with Transformer-based text/image encoders, fuses via attention or lightweight MLPs, and stacks standard sequential architectures (e.g., SASRec) for robust, transferable pipelines (Li et al., 2024, Song et al., 2023).
Graph-based Fusion: MMSR (Adaptive Multi-Modalities Fusion) and MuSTRec model item and user histories as sequences or graphs, assigning each item multimodal nodes, and using dual attention to separately fuse sequential (intra-modality) and cross-modal (inter-modality) signals (Hu et al., 2023, Sahyouni et al., 6 Feb 2026). The fusion order can be adaptively gated for each user, enabling early, late, or hybrid fusion by learning a continuous gate.
Diffusion-based Denoising: M³BSR introduces conditional diffusion processes at both the modality (denoising image/text features using ID context) and behavior level (denoising noisy behaviors such as clicks using cleaner ones such as favoriting), to suppress both feature and feedback noise (Cui et al., 7 Aug 2025).
Mixture-of-Experts and Information-Theoretic Decomposition: PRISM and CAMMSR deploy mixtures of experts to disentangle unique, redundant, and synergistic information from each modality. Experts are adaptively weighted for each user/sequence based on user-centric, category, or contextual cues, enforcing fine-grained and dynamic synergy modeling (Zhang et al., 16 Jan 2026, Xu et al., 4 Mar 2026).
Interest-Centralized and Multiscale Models: MDSRec constructs explicit modal-aware item relation graphs, applies interest-centralized attention over modality-specific interest centers, and uses gating to align sequence-level embeddings with clustered interest tokens (Li et al., 2024). BiVRec learns structured interest representations in both ID and multimodal spaces, aligning and contrasting them for enhanced cross-view transfer (Hu et al., 2024).
Attention-based Models and Online Distillation: Systems such as attention-based multimodal sequential models perform per-modality self-attention and then fuse attention maps rather than embeddings, further improving with multi-task learning over auxiliary prediction/reconstruction losses (Oh et al., 2024). ODMT uses an ID-aware multimodal Transformer combined with cross-modal distillation losses for joint optimization (Ji et al., 2023).
LLM and MLLM-based Pipelines: State-of-the-art approaches leverage Multimodal LLMs (MLLMs) or LLMs augmented with vision encoders, incorporating specialized modules for item summarization, recurrent user-preference summarization, and lightweight supervised fine-tuning (SFT) for the recommendation head (Ye et al., 2024, Wang et al., 24 Dec 2025, Wang et al., 6 Mar 2026, Zhong et al., 8 Nov 2025). Novel strategies textualize images for efficient prompt construction, fuse collaborative signals via keyword context, and apply advanced reasoning policy optimization (GRPO) for CoT-enhanced learning.

3. Multimodal Fusion, Difference, and Synergy Mechanisms

Fusion and synergy modeling in MSR deviate significantly from naïve concatenation approaches:

Adaptive and Category-guided Fusion: CAMMSR introduces the CAMoE module, which dynamically gates expert networks by modality and item category, allowing the model to prioritize different modality perspectives for each item depending on contextual cues (Xu et al., 4 Mar 2026).
Contrastive and Difference Learning: Explicit contrastive objectives—pairwise across modalities, as in modality-swap contrastive learning, or alignment losses across ID and multimodal views—serve to enforce consistency and specificity in user and item embedding spaces (Li et al., 2024, Hu et al., 2024, Wang et al., 2023).
Information Decomposition: PRISM, grounded in partial information decomposition, separates unique, redundant, and synergistic signals, combining the corresponding expert outputs via adaptive, sequence-aware fusion (Zhang et al., 16 Jan 2026).
Behavioral Guidance and Multi-Behavior Integration: M³BSR not only denoises modalities but also models behavior types (click, favor) with separate denoising processes, using deeper behaviors as anchors for resolving shallow feedback noise (Cui et al., 7 Aug 2025).
Temporal and Frequency-Aware Mechanisms: MMM4Rec and MuSTRec employ temporal state-space or frequency-domain filtering to differentially weight recent/modal interactions, overcoming the "flat" fusion of Transformer-style models (Fan et al., 3 Jun 2025, Sahyouni et al., 6 Feb 2026).

4. Efficient Fine-Tuning, Parameterization, and Scalability

Recent MSR advances address the practical bottlenecks of heavy fine-tuning and excessive memory cost:

Parameter-Efficient Fine-Tuning (PEFT) and Decoupled Adaptation: IISAN and MMSR adopt small side networks (e.g., Side Adaptation Networks) on top of frozen multimodal backbones, reducing both backward-path memory and training time per epoch by over an order of magnitude relative to full fine-tuning, and introducing a composite efficiency metric (TPME) balancing training time, parameter count, and GPU memory (Fu et al., 2024).
Single-Pass/Token Summarization and Prompt Compression: Speeder, MMSRARec, and MLLM-MSR utilize item- and user-level summarization or multimodal token compression to minimize token count for LLM-based inference, accelerating both fine-tuning and online serving by up to 4× in computational efficiency (Zhong et al., 8 Nov 2025, Wang et al., 24 Dec 2025, Ye et al., 2024).
Gradient Balancing and Adaptive Modality Learning: REVEAL introduces Feedback-guided Visual Extraction (FVE) for prompt tuning and Adaptive Visual Learning (AVL) for dynamic visual-text gradient reweighting, allowing plug-and-play enhancement to almost any MSR backbone (Li et al., 8 Jun 2026).

5. Empirical Results and Benchmarks

MSR systems have been evaluated extensively on both public and industrial-scale datasets, using metrics such as Hit Rate (HR@K), NDCG@K, AUC, and MRR@K:

Performance Gains: State-of-the-art MSR models (e.g., M³BSR, PRISM, CAMMSR, MuSTRec, MMSRARec) report relative gains of 6–33% NDCG or Recall over both classical sequential baselines (GRU4Rec, SASRec, BERT4Rec) and earlier multimodal approaches (MMSR, MMMLP, FDSA) (Cui et al., 7 Aug 2025, Zhang et al., 16 Jan 2026, Xu et al., 4 Mar 2026, Sahyouni et al., 6 Feb 2026, Wang et al., 24 Dec 2025).
Cold-Start and Transfer: Multimodal and ID-agnostic pipelines consistently alleviate cold-item sparsity, yielding 30–50× boosts in HR@10 for cold items/topics (Li et al., 2024, Song et al., 2023, Wang et al., 2023). Pretrained representations enable robust cross-domain transfer with limited new data (Fan et al., 3 Jun 2025, Song et al., 2023).
Ablation Studies: Empirical decoupling of components demonstrates that modality-specific denoising, contrastive identity/multimodal alignment, mixture-of-experts routing, and category-guided fusion each contribute significantly to final performance, with removal often incurring 3–14% performance drops on core benchmarks (Cui et al., 7 Aug 2025, Li et al., 2024, Hu et al., 2024, Zhang et al., 16 Jan 2026).

6. Advanced Topics: Reasoning, Interpretability, and Robustness

Reasoning and Chain-of-Thought: MLLMRec-R1 incorporates GRPO-based training with multimodal Chain-of-Thought (CoT) supervision, textualizing visual histories for scalable RL-finetuning and robust ranking improvements (Wang et al., 6 Mar 2026).
Interpretability: MMSRARec utilizes offline MLLM-driven summarization to map item histories to concise keywords, making decision rationales traceable and human-interpretable (Wang et al., 24 Dec 2025).
Fusion Robustness: Approaches such as MDSRec, BiVRec, and REVEAL are empirically robust to missing or noisy modalities, partial category corruption, and behavioral feedback noise, often demonstrating superior cold-start and transfer resilience compared to both unimodal and naïve fusion baselines (Li et al., 2024, Hu et al., 2024, Li et al., 8 Jun 2026).

7. Limitations and Future Directions

Contemporary MSR models, while substantially more robust and performant than prior baselines, exhibit limitations:

Most focus on textual and visual modalities; audio, user-profile, and graph modalities remain under-integrated (Li et al., 2024, Li et al., 2024).
Fine-grained user modeling (e.g., per-session or demographic fusion) is only partially explored (Zhang et al., 16 Jan 2026).
Scalability to extremely long histories and industrial-scale corpora may require further innovations in model compression, online adaptation, and hybrid GNN/Transformer integration (Fu et al., 2024, Zhong et al., 8 Nov 2025).
End-to-end optimization, particularly for LLM/MLLM-based MSR, is an active research area, including joint training/inference and dynamic prompt or CoT adaptation (Wang et al., 24 Dec 2025, Wang et al., 6 Mar 2026).