Papers
Topics
Authors
Recent
Search
2000 character limit reached

All-Modality Generative Recommendation

Updated 15 May 2026
  • All-modality generative recommendation is an emerging paradigm that unifies multimodal user interaction and content synthesis, enabling generation of personalized text, images, and audio.
  • It leverages autoregressive models and cross-modal tokenization to fuse discrete semantic representations across modalities with reinforcement learning for optimization.
  • Empirical studies show enhanced personalization and flexibility compared to retrieval-based systems, with significant improvements in metrics like NDCG and CLIP similarity.

All-modality generative recommendation is an emerging paradigm in information access systems that unifies multimodal content understanding, user preference modeling, and personalized content generation within a single end-to-end generative framework. Rather than re-ranking existing items for user selection, all-modality generative recommenders synthesize new or predicted user-relevant items—spanning images, text, audio, and other modalities—directly conditioned on a user’s multimodal interaction history. This approach integrates advances in large multimodal models (LMMs), hierarchically discrete semantic tokenization, sequence modeling, and reinforcement learning-driven reward optimization, surpassing classic retrieval-based methods in both flexibility and personalization (Liu et al., 2 Jun 2025).

1. Paradigm Shift: From Retrieval to Generative, Multimodal Recommendation

Traditional recommender systems are fundamentally bounded by the limitation that user preferences are satisfied only by selecting from a fixed corpus of pre-existing items. Even with sophisticated multimodal fusion, these systems cannot invent novel content outside the training set (Liu et al., 2 Jun 2025, Hou et al., 31 Oct 2025). All-modality generative recommenders address this core limitation in two complementary ways:

  • Autoregressive generation—Modeling the next user-relevant item as a sequence of discrete tokens (semantic IDs), potentially corresponding to quantized features across text, image, and other modalities, rather than as a label from a closed vocabulary (Zhang et al., 19 Nov 2025, Vandenhirtz et al., 3 Feb 2026).
  • Modality-universal modeling—Leveraging architectures capable of ingesting and emitting arbitrary combinations of textual, visual, and other content modalities; the same model can decode text, images, or even audio/video representations conditioned on a user's full multimodal interaction history (Liu et al., 2 Jun 2025, Wei et al., 2024).

This results in a system that can, for example, produce a personalized image aligned with a user's demonstrated historical preferences—an operation not accessible to retrieval-based recommenders (Liu et al., 2 Jun 2025). The approach is extensible to video and audio, provided paired data and tokenization mechanisms are available.

2. Model Architectures and Cross-Modal Tokenization

All-modality generative recommendation requires a modeling pipeline that (a) unifies multimodal user and item representations and (b) supports autoregressive or iterative content generation. This is achieved through:

Typical architectures are illustrated by the use of unified backbones with modality-agnostic decoding heads, facilitating the generation of either images, text, or multimodal outputs as required (Liu et al., 2 Jun 2025, Wei et al., 2024). For instance, UniMP uses a frozen vision-language foundation model with cross-attention at every alternate transformer block to blend visual and textual semantics (Wei et al., 2024).

3. Training Objectives: Supervised, Contrastive, and Reinforcement Learning

The training pipeline in all-modality generative recommendation proceeds through distinct but complementary stages:

LSFT=uk=1Huj=1JlogpLMM(Iuk;jP(Hu[1:k1]),Iuk;<j)\mathcal{L}_{\mathrm{SFT}} = - \sum_{u}\sum_{k=1}^{|\mathcal{H}_u|} \sum_{j=1}^{J} \log p_{\mathrm{LMM}}\bigl(\mathcal{I}_{u_k;j} \mid \mathcal{P}(\mathcal{H}_u[1:k-1]),\mathcal{I}_{u_k;<j}\bigr)

LGRPO=EH,{Ij}[1Gj=1G(min{ωjAj,clip(ωj,1±ε)Aj}βKL[πθ(H)πref(H)])]\mathcal{L}_{\mathrm{GRPO}} = \mathbb{E}_{\mathcal{H},\{\mathcal{I}_j\}}\left[\frac{1}{G}\sum_{j=1}^{G}(\min\{\omega_j A_j,\mathrm{clip}(\omega_j,1\pm\varepsilon)A_j\}-\beta\,\mathrm{KL}[\pi_\theta(\cdot|\mathcal{H})\|\pi_{\mathrm{ref}}(\cdot|\mathcal{H})])\right]

Leading models enhance tokenization robustness and codebook utilization through multi-aspect alignment (MACRec), hierarchical residual quantization (FusID), and explicit fusion of collaborative signals (CEMG, MSCGRec) (Zhang et al., 19 Nov 2025, Vandenhirtz et al., 3 Feb 2026, Lin et al., 25 Dec 2025, Kim et al., 13 Jan 2026).

4. Modalities: Conditioning, Fusion, and Generalization

All-modality generative recommendation systems are designed to operate over arbitrary combinations of modalities:

  • Multimodal item representation—Each item is represented with parallel tokenizations for different modalities (e.g., text, image, collaborative, spatial), either concatenated in late fusion or merged through explicit fusion layers (e.g., soft-gating, MLP, cross-attention) (Zhang et al., 19 Nov 2025, Vandenhirtz et al., 3 Feb 2026, Lin et al., 25 Dec 2025).
  • User-aware and context-aware conditioning—User histories combine sequentially interleaved multimodal tokens, with personalized prompts guiding generation (Liu et al., 2 Jun 2025). New modalities are incorporated by extending pretrained encoders and quantizers or adapting reward functions to modality-relevant metrics (e.g., LPIPS/SSIM for images, ASR+CLIP for speech) (Liu et al., 2 Jun 2025).
  • Generalization—While most published models target text and images, frameworks are explicitly proposed as modality-agnostic, with potential for audio/video extension, given appropriate discrete tokenization and evaluation metrics (Liu et al., 2 Jun 2025, Wei et al., 2024).
  • Collaborative and knowledge signals—Recent models explicitly include collaborative features as an additional modality, encoded and quantized alongside content representations and fused prior to code generation (MSCGRec, CEMG) (Vandenhirtz et al., 3 Feb 2026, Lin et al., 25 Dec 2025).

A representative overview:

Model Tokenization Fusion Modalities (paper)
Janus-Pro-1B VQ-VAE (images), text Unified Transformer Image, text
MACRec RQ-VAE (text, image), cross-modal alignment T5 + scoring ensemble Text, image
MSCGRec RQ (text, image, coll) Token concat + custom rel. pos. Text, image, collab
UniMP Flat sequence, VL cross-attn Foundation LM + Cross-attn Image, text, ID
CEMG RQ-VAE (fusion) Soft-gating fusion (collab-guided) Text, image, collab

5. Empirical Evaluation and Comparative Performance

Evaluation frameworks for all-modality generative recommendation are grounded in both information retrieval and generation metrics:

Empirically, all-modality generative recommenders consistently outperform both single-modality and retrieval-based systems across domains and modalities. Notable empirical findings include:

  • Janus+SFT+RL achieves the highest relevance and aesthetic scores on MovieLens and PixelRec; RL tuning lifts performance by up to 10 points on CLIP-based metrics (Liu et al., 2 Jun 2025).
  • MACRec outperforms MQL4GRec by +5–6% NDCG@10 via deep cross-modal alignment (Zhang et al., 19 Nov 2025).
  • TriAlignGR’s multitask (SID, text, image) alignment surpasses baseline HR@5/NDCG@5 by +13–15% (Zeng et al., 5 May 2026).
  • MSCGRec, fusing semantic, collaborative, and constrained decoding, improves NDCG@10 by up to +36% versus prior generative methods and matches or exceeds top discriminative models at industrial scale (Vandenhirtz et al., 3 Feb 2026).
  • Large benchmark competitions (e.g., Tencent Advertising Algorithm Challenge 2025) identify key contributions from heavy multimodal quantization, explicit feedback/action-type conditioning, and conversion-weighted loss functions (Pan et al., 4 Apr 2026).

6. Challenges, Limitations, and Open Directions

Current all-modality generative recommendation approaches face several critical challenges and research frontiers:

  • Tokenization conflicts and codebook underutilization—Without adequate contrastive or cross-modal alignment, different modalities can yield redundant or ambiguous semantic IDs, undermining discriminative power (Kim et al., 13 Jan 2026). Strategies such as contrastive tokenization, modality-aware product quantization, and chain-of-thought interest mining mitigate these issues (Zhai et al., 20 Jun 2025, Kim et al., 13 Jan 2026, Zeng et al., 5 May 2026).
  • Content degradation and semantic opacity—Sequential quantization can discard high-level semantic and latent user interest information (SID Content Degradation—SCD) and produce token sequences without explicit semantic mapping (SID Semantic Opacity—SSO). TriAlignGR addresses this via two-stage CMSA/MDIM before quantization and triangular multitask alignment (Zeng et al., 5 May 2026).
  • Bias, fairness, and explanation—Distributional and selection biases, particularly popularity bias, pose risks. LLM4Rec introduces causal debiasing (propensity scoring, do-calculus, adversarial losses) and enforces demographic fairness, alongside LLM-based explainable generation (Ma et al., 2 Oct 2025).
  • Scalability and deployment—While compact code vocabulary and constrained decoding enable large-scale deployment, Trie-based decoding and long code sequences remain potential bottlenecks for massive catalogs (Lin et al., 25 Dec 2025). Memory and inference costs must be continually monitored.
  • Generalization to novel modalities—Audio, video, spatial, and multi-source context (e.g., sensor/IoT data) are only at the threshold of integration, requiring new tokenizers, reward functions, and training paradigms (Liu et al., 2 Jun 2025, Wei et al., 2024). Open benchmarks (e.g., TencentGR-10M) are starting to include such breadth at scale (Pan et al., 4 Apr 2026).
  • Robustness and hallucination—Generated content must remain grounded in user history and actual item candidates; hallucination, spurious alignment, and reward hacking can undermine trust and effectiveness (Liu et al., 2 Jun 2025, Hou et al., 31 Oct 2025).

7. Synthesis and Prospects

All-modality generative recommendation represents a foundational shift in recommender system design, dissolving modality boundaries and retrieval constraints in favor of end-to-end generative architectures capable of personalized content synthesis. This paradigm, exemplified by approaches leveraging hierarchical cross-modal tokenization, unified autoregressive modeling, RL-driven personalization, and rigorous empirical evaluation, demonstrates clear advantages in personalization, extensibility, and generative capacity (Liu et al., 2 Jun 2025, Zhang et al., 19 Nov 2025, Vandenhirtz et al., 3 Feb 2026).

The crucial architectural ingredients—discrete semantic token spaces, robust multimodal alignment, fusion with collaborative signals, and pragmatic deployment strategies—are rapidly converging toward scalability and effectiveness for real-world information systems. Ongoing research focuses on multimodal scalability, robustness, fairness, and support for new content types, with open industrial benchmarks accelerating progress.

Ultimately, all-modality generative recommendation provides both a practical framework and a conceptual blueprint for next-generation intelligent assistants and content platforms, capable of holistic, user-driven synthesis across text, vision, and beyond (Hou et al., 31 Oct 2025, Wei et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to All-Modality Generative Recommendation.