Multimodal Generative Recommendation (MGR)

Updated 4 September 2025

Multimodal Generative Recommendation (MGR) is a framework that leverages generative models to synthesize item identifiers by integrating diverse data modalities like text, images, audio, and structured metadata.
MGR employs advanced tokenization and alignment strategies—such as vector quantization, contrastive pretraining, and modality-specific tokens—to overcome challenges of modality dominance and misalignment.
By using autoregressive sequence models, gated fusion, and relation-aware self-attention, MGR systems achieve enhanced personalization, scalability, and empirical improvements in recommendation performance.

Multimodal Generative Recommendation (MGR) refers to a family of frameworks and methodologies that leverage generative modeling techniques to recommend items by integrating and generating across multiple data modalities—such as text, images, audio, and structured metadata. Contrasting conventional retrieval- or embedding-based recommendation systems, MGR architectures directly operate over rich semantic item representations, often producing recommendations by autoregressively generating item identifiers or content in a modality-aware manner. The development of MGR addresses limitations in previous recommender paradigms regarding modality alignment, robustness, scalability, and the ability to capture cross-modal user preferences.

1. Core Principles and Challenges

MGR systems are characterized by two primary advances over prior architectures: (1) the use of generative models that predict or synthesize target item identifiers/content given a user’s multimodal interaction history, and (2) the design of shared or unified item representations that capture complementary signals from heterogeneous modalities.

However, the inherent heterogeneity of modalities creates several challenges:

Modality Sensitivity: Naïve multi-modal fusion strategies frequently allow one modality (commonly text) to dominate the learned representation, with the semantic signal from others (e.g., vision) suppressed or lost (Zhu et al., 30 Mar 2025).
Modality Correspondence and Alignment: Simple concatenation (late fusion) leads to misalignment—generated semantic IDs for different modalities may no longer refer to the same item (Zhu et al., 30 Mar 2025). Cross-modal prediction and explicit alignment mechanisms are needed.
Conditional Generation across Modalities: The assumption that the next token in a generative model is a continuation of the previous one breaks down when modalities switch (e.g., from text to image), confusing the predictive process (Zhu et al., 30 Mar 2025).
Scalability: Generative modeling with large multi-modal vocabularies (e.g., billion-scale POI indices) requires efficient indexing and decoding strategies (Lin et al., 22 Aug 2025).
Generalizability and Transferability: Domain- or modality-specific tokenizers hamper adaptation across domains or content types (Zheng et al., 6 Apr 2025).
Robustness to Noisy and Missing Modalities: The fusion scheme must adaptively manage varying data quality and incomplete multimodal inputs (Liu et al., 30 May 2025).

2. Item Representation and Tokenization in MGR

A foundational element for generative recommendation is the transformation of raw multimodal content into discrete codes or tokenized identifiers that the generative model can sequence or produce. Several approaches have been proposed:

Vector Quantization with Modality Separation: Systems like MMGRec and MGR-LF++ generate per-modality semantic codes for items, often via residual-quantized VAE (RQ-VAE) or similar (Liu et al., 25 Apr 2024, Zhu et al., 30 Mar 2025). Early fusion (single encoder) loses modality specificity; late fusion (concatenated unimodal IDs) solves this at the expense of alignment challenges (Zhu et al., 30 Mar 2025).
Contrastive Modality Alignment: MGR-LF++ introduces contrastive pretraining tasks mapping image semantic IDs to text semantic IDs and vice versa, ensuring the generative model learns to produce corresponding semantic codes across modalities (Zhu et al., 30 Mar 2025).
Special Tokens for Modality Transitions: To address conditional generation failures when switching modalities, special tokens are inserted at boundaries to demarcate modality context, guiding the model to treat the next token as a new modality segment (Zhu et al., 30 Mar 2025).
Universal Tokenization: UTGRec deploys a universal tokenizer that condenses multimodal (text and image) content into a shared sequence of discrete codes, leveraging multimodal LLMs (e.g., Qwen2-VL) and tree-structured codebooks with prefix residual operations to ensure both cross-modality and cross-domain transferability (Zheng et al., 6 Apr 2025).
Quantitative Language Unification: MQL4GRec converts text and image descriptors into a shared quantitative token language via an RQ-VAE mechanism, enabling unified knowledge transfer and cross-modality knowledge sharing (Zhai et al., 20 Feb 2025).

These tokenization strategies not only facilitate cross-modal generation but also embed rich semantic and collaborative knowledge, crucial for effective downstream sequence modeling.

3. Generative Sequence Modeling and Decoding

Once discrete multimodal codes are available, MGR architectures typically use autoregressive sequence models (usually Transformers) to predict the next item:

Autoregressive Generation: The model predicts the next item’s code sequence (i.e., semantic ID), one token at a time, given the user’s historical sequence (also as codes). Beam search is often employed at inference to generate top-k item candidates efficiently (Liu et al., 25 Apr 2024).
Relation-aware Self-Attention: MMGRec addresses the lack of natural sequential order in some domains by incorporating relation-aware self-attention, replacing absolute positional encodings with user-specific pairwise relation encodings (Liu et al., 25 Apr 2024).
Unified Multi-modal Personalization: Frameworks such as UniMP treat downstream recommendation-related tasks—recommendation, preference prediction, explanation generation, even image generation—as next-token generation problems within a unified personalized generative model (Wei et al., 15 Mar 2024).

For scalability in large-scale catalogs, efficient tokenization (e.g., hierarchical POI indexing in Spacetime-GR) and generation strategies are central to real-world deployment (Lin et al., 22 Aug 2025).

4. Fusion and Alignment of Multimodal Information

Effective use of multiple modalities requires carefully designed fusion and alignment mechanisms:

Late Fusion with Alignment: MGR-LF++’s late fusion architecture avoids modality dominance by learning separate representations for each modality and fusing them after explicit contrastive alignment (Zhu et al., 30 Mar 2025).
Gated and Adaptive Fusion: RLMultimodalRec employs a gated mechanism to dynamically adjust the importance of modalities when item content quality varies, improving robustness and interpretability (Liu et al., 30 May 2025).
Joint Graph and Hypergraph Structures: LGMRec’s paradigm decouples collaborative and modality-specific signals at the local graph level and then mines global dependency structures via cross-modal hypergraphs, integrating them through additive fusion and normalization (Guo et al., 2023).
Disentangled and Interpretable Spaces: Models like DGVAE disentangle semantic representations and exploit mutual information maximization to align latent spaces originating from different modalities, promoting interpretability in recommendations (Zhou et al., 25 Feb 2024).

Approaches such as EGRA further propose enhanced behavior graphs built from pretrained modality-aware embeddings and bi-level dynamic alignment weighting for fine-grained and progressive representation alignment across entities and training epochs (Zhang et al., 22 Aug 2025).

5. Knowledge Transfer, Adaptation, and Practical Scaling

Modern MGR systems emphasize domain and modality transfer, efficiency, and industrial deployment:

Unified Quantitative Language and Multi-domain Knowledge Transfer: By expressing both text and images as a shared token vocabulary and training on multi-domain data, MQL4GRec enables robust transfer and leverages unified language for cross-domain and cross-task generalization (Zhai et al., 20 Feb 2025).
Multimodal Pretraining and Adaptation: Recent surveys highlight prevailing strategies such as reconstructive/contrastive/autoregressive pretraining, prompt tuning, and module-based adapter tuning (e.g., LoRA) for modular adaptation (Liu et al., 31 Mar 2024).
Industrial Training Frameworks: At production scale, frameworks like MTGRBoost address the computational costs of dense sequence modeling over massive datasets by introducing dynamic hash tables for real-time insertion/deletion, dynamic sequence balancing for GPU load, two-level embedding ID deduplication, and operator fusion for kernel efficiency. This enables near-linear scaling when deployed on hundreds of GPUs (Wang et al., 19 May 2025).
User-level Compression and Feature Reuse: MTGR compresses candidate scoring by grouping all candidate items for a user into a single batch, fully reusing user-side computations, and decoupling user and candidate complexity. Cross features from DLRMs are preserved within a transformer-like model by sequencing all candidate–user interactions jointly (Han et al., 24 May 2025).
Scalability to Billion-scale Catalogs: Spacetime-GR employs geographic-aware hierarchical POI indexing (reducing a 100M vocabulary to ~400K tokens) and carefully engineered post-training adaptation methods for real-time, online industrial POI recommendation (Lin et al., 22 Aug 2025).

6. Extensions, Applications, and Emerging Directions

Ongoing research and deployment contexts illuminate both current limitations and promising frontiers.

Personalized Content Generation: Some models propose “generate, not recommend”—moving from filtering existing items to directly synthesizing new personalized multimodal content (e.g., personalized movie posters, micro-video covers) via any-to-any large multimodal models trained with a mix of supervised and reinforcement learning objectives (Liu et al., 2 Jun 2025).
Spatiotemporal Modeling: Spacetime-GR and related methods explicitly encode spatiotemporal context (e.g., through token embeddings for temporal/geographic factors) to better model user behavior in domains such as location-based services (Lin et al., 22 Aug 2025, Kanzawa et al., 4 Oct 2024).
Cold-Start, Long-tail, and Robustness: Enhanced item graphs (EGRA) and modular gated fusion (RLMultimodalRec) have proven effective at improving long-tail and cold-start recommendation accuracy, including interpretability and modular adaptability for future extension to new modalities (Zhang et al., 22 Aug 2025, Liu et al., 30 May 2025).
Multimodal User Interaction: Cutting-edge systems embrace cross-modal search, instruction following, and dialogue-based refinement in generative recommendation, leveraging MLLMs for interactive, personalized experiences (Ramisa et al., 17 Sep 2024, Wei et al., 15 Mar 2024).
Transferable, Adaptable, and Universal Models: Universal tokenization frameworks (UTGRec) and quantized LLMs (MQL4GRec) facilitate rapid adaptation to new domains, supporting scalable joint pretraining and efficient fine-tuning (Zheng et al., 6 Apr 2025, Zhai et al., 20 Feb 2025).
Efficient Ranking and Deployment: Deployments in industrial settings (Meituan, large-scale e-commerce) validate both the relevance and performance potential of MGR models at massive scale, maintaining sub-linear inference costs, robust candidate scoring, and the integration of cross features into manageable transformer-based pipelines (Han et al., 24 May 2025, Wang et al., 19 May 2025).

7. Empirical Results, Open Challenges, and Future Opportunities

Empirical studies across Amazon, MovieLens, TikTok, Kwai, and large-scale industrial data provide evidence for the superiority of MGR frameworks over traditional and unimodal baselines:

Metric Gains: MGR models demonstrate relative improvements of 7–20% in NDCG, Recall, and related ranking metrics over leading alternatives (Zhu et al., 30 Mar 2025, Zhai et al., 20 Feb 2025, Zheng et al., 6 Apr 2025, Liu et al., 30 May 2025).
Ablations and Interpretability: Fine-grained ablations (e.g., fusion strategies, alignment losses, tokenization approaches) validate the contribution of each architecture module. DGVAE, for instance, delivers interpretability by projecting recommendations into human-language summaries (Zhou et al., 25 Feb 2024).
Robustness: Strategies such as contrastive alignment, gating, and regularization (e.g., Mirror Gradient) consistently improve robustness to noisy, incomplete, or shifting multimodal inputs (Zhong et al., 17 Feb 2024, Liu et al., 30 May 2025).
Industrial Throughput: Training frameworks achieve up to 2.4× throughput improvements and near-linear GPU scaling; inference maintains sub-linear growth with catalog size (Wang et al., 19 May 2025, Han et al., 24 May 2025).
Open Problems: Remaining challenges include efficient model adaptation for new domains/modalities, learning fusion strategies that balance low-level and semantic features, computational efficiency of large generative models, and principled interpretability across modalities (Liu et al., 31 Mar 2024, Zhu et al., 30 Mar 2025, Ramisa et al., 17 Sep 2024).
Emerging Directions: Future research is likely to focus on expanding to more modalities (e.g., audio/video, structured sensor data) (Liu et al., 31 Mar 2024), more universal tokenization strategies (Zheng et al., 6 Apr 2025, Zhai et al., 20 Feb 2025), and conversational/interactive recommendation interfaces (Ramisa et al., 17 Sep 2024).

Summary Table: Representative MGR Models, Innovations, and Gains

Model	Innovations	Reported Gains
MGR-LF++ (Zhu et al., 30 Mar 2025)	Contrastive modality alignment, special tokens, late fusion	+20% vs. unimodal and early fusion
MMGRec (Liu et al., 25 Apr 2024)	Rec-ID via Graph RQ-VAE, autoregressive generation, relation-aware self-attention	Up to +7% NDCG
UTGRec (Zheng et al., 6 Apr 2025)	Universal tokenizer, tree codebooks, dual decoders, knowledge alignment	Statistically significant over baselines
MQL4GRec (Zhai et al., 20 Feb 2025)	Unified quantitative language, cross-modal generation tasks	+11.18%–14.82% NDCG
MTGR, MTGRBoost (Han et al., 24 May 2025, Wang et al., 19 May 2025)	User-level compression, GLN, dynamic hash tables, 2× speedup, sub-linear inference	Largest real-world gains in Meituan
Spacetime-GR (Lin et al., 22 Aug 2025)	Geographic-aware POI indexing, multimodal semantic embeddings, spatiotemporal tokens	+1–1.3% AUC; +6% CTR online

All claims in the table trace to the referenced papers and their documented experimental results.

In sum, Multimodal Generative Recommendation now encompasses a suite of architectures and tools for seamlessly integrating, generating, and aligning recommendations across diverse modalities and domains. By directly operating over semantic tokenizations and leveraging robust generative sequence models, MGR systems realize richer, more adaptive, and more scalable recommender systems. These advances—grounded in empirical improvements and industrial validation—herald a new stage for recommender research and deployment with multimodal and generative intelligence at its core.