All-Modality Generative Recommendation

Updated 15 May 2026

All-modality generative recommendation is an emerging paradigm that unifies multimodal user interaction and content synthesis, enabling generation of personalized text, images, and audio.
It leverages autoregressive models and cross-modal tokenization to fuse discrete semantic representations across modalities with reinforcement learning for optimization.
Empirical studies show enhanced personalization and flexibility compared to retrieval-based systems, with significant improvements in metrics like NDCG and CLIP similarity.

All-modality generative recommendation is an emerging paradigm in information access systems that unifies multimodal content understanding, user preference modeling, and personalized content generation within a single end-to-end generative framework. Rather than re-ranking existing items for user selection, all-modality generative recommenders synthesize new or predicted user-relevant items—spanning images, text, audio, and other modalities—directly conditioned on a user’s multimodal interaction history. This approach integrates advances in large multimodal models (LMMs), hierarchically discrete semantic tokenization, sequence modeling, and reinforcement learning-driven reward optimization, surpassing classic retrieval-based methods in both flexibility and personalization (Liu et al., 2 Jun 2025).

1. Paradigm Shift: From Retrieval to Generative, Multimodal Recommendation

Traditional recommender systems are fundamentally bounded by the limitation that user preferences are satisfied only by selecting from a fixed corpus of pre-existing items. Even with sophisticated multimodal fusion, these systems cannot invent novel content outside the training set (Liu et al., 2 Jun 2025, Hou et al., 31 Oct 2025). All-modality generative recommenders address this core limitation in two complementary ways:

Autoregressive generation—Modeling the next user-relevant item as a sequence of discrete tokens (semantic IDs), potentially corresponding to quantized features across text, image, and other modalities, rather than as a label from a closed vocabulary (Zhang et al., 19 Nov 2025, Vandenhirtz et al., 3 Feb 2026).
Modality-universal modeling—Leveraging architectures capable of ingesting and emitting arbitrary combinations of textual, visual, and other content modalities; the same model can decode text, images, or even audio/video representations conditioned on a user's full multimodal interaction history (Liu et al., 2 Jun 2025, Wei et al., 2024).

This results in a system that can, for example, produce a personalized image aligned with a user's demonstrated historical preferences—an operation not accessible to retrieval-based recommenders (Liu et al., 2 Jun 2025). The approach is extensible to video and audio, provided paired data and tokenization mechanisms are available.

All-modality generative recommendation requires a modeling pipeline that (a) unifies multimodal user and item representations and (b) supports autoregressive or iterative content generation. This is achieved through:

Any-to-any Large Multimodal Models (LMMs)—Unified transformers ingesting both text and image tokens (typically via wordpiece tokenizers for text and VQ-VAE or residual quantization VAE tokenizers for images) and outputting the same (Liu et al., 2 Jun 2025). Autoregression is maintained over a joint vocabulary.
Hierarchical and cross-modal semantic quantization—Item content in each modality is mapped to sequences of discrete tokens using schemes such as multi-level residual quantization, with codebooks trained either through reconstruction (VQ-VAE), self-supervised representation learning (DINO), or contrastive objectives (Zhang et al., 19 Nov 2025, Vandenhirtz et al., 3 Feb 2026, Zhai et al., 20 Jun 2025). Recent advances (e.g., SimCIT) foreground contrastive alignment over mere reconstruction to maximize code discrimination and multimodal integration (Zhai et al., 20 Jun 2025).
Unified input format and fusion—User histories are serialized as joint sequences of semantic IDs across modalities, sometimes with modality separator tokens or learned inter-modality alignments (Liu et al., 2 Jun 2025, Hou et al., 31 Oct 2025, Zhu et al., 30 Mar 2025).
Personalization via prompt design—Generation is conditioned on user context through template-based prompts, e.g., "Please generate a movie poster that would attract my interest. Here are the movies I have watched: …" (Liu et al., 2 Jun 2025). All modalities in the user's history inform autoregressive self-attention.

Typical architectures are illustrated by the use of unified backbones with modality-agnostic decoding heads, facilitating the generation of either images, text, or multimodal outputs as required (Liu et al., 2 Jun 2025, Wei et al., 2024). For instance, UniMP uses a frozen vision-language foundation model with cross-attention at every alternate transformer block to blend visual and textual semantics (Wei et al., 2024).

3. Training Objectives: Supervised, Contrastive, and Reinforcement Learning

The training pipeline in all-modality generative recommendation proceeds through distinct but complementary stages:

Supervised Fine-Tuning (SFT)—Models are trained to predict the next item (in its tokenized multimodal representation) given user history, minimizing cross-entropy over the true sequence of code tokens (Liu et al., 2 Jun 2025, Zhang et al., 19 Nov 2025, Hou et al., 31 Oct 2025). Typical loss formulations:

$\mathcal{L}_{\mathrm{SFT}} = - \sum_{u}\sum_{k=1}^{|\mathcal{H}_u|} \sum_{j=1}^{J} \log p_{\mathrm{LMM}}\bigl(\mathcal{I}_{u_k;j} \mid \mathcal{P}(\mathcal{H}_u[1:k-1]),\mathcal{I}_{u_k;<j}\bigr)$

Cross-modal alignment and contrastive losses—To ensure the semantic IDs learned from different modalities are sufficiently aligned, contrastive or InfoNCE-style objectives are used both at the codebook learning stage (contrastive quantization, cross-modal reconstruction alignment) and optionally for the entire model (Zhang et al., 19 Nov 2025, Zhai et al., 20 Jun 2025, Vandenhirtz et al., 3 Feb 2026).
Reinforcement Learning Fine-Tuning—Online RL mechanisms (notably Group Relative Policy Optimization, GRPO) are introduced to adjust generation policies based on downstream composite rewards, including future relevance, semantic similarity (CLIP/DINO), diversity, and content aesthetics (NIMA) (Liu et al., 2 Jun 2025). The RL objective is:

$\mathcal{L}_{\mathrm{GRPO}} = \mathbb{E}_{\mathcal{H},\{\mathcal{I}_j\}}\left[\frac{1}{G}\sum_{j=1}^{G}(\min\{\omega_j A_j,\mathrm{clip}(\omega_j,1\pm\varepsilon)A_j\}-\beta\,\mathrm{KL}[\pi_\theta(\cdot|\mathcal{H})\|\pi_{\mathrm{ref}}(\cdot|\mathcal{H})])\right]$

Constrained Decoding—Inference leverages prefix-tree (Trie)-masked beam search over valid token paths, ensuring only known or permissible item code sequences are generated. This is essential for both reliability and computational efficiency at industrial scale (Vandenhirtz et al., 3 Feb 2026, Lin et al., 25 Dec 2025).

Leading models enhance tokenization robustness and codebook utilization through multi-aspect alignment (MACRec), hierarchical residual quantization (FusID), and explicit fusion of collaborative signals (CEMG, MSCGRec) (Zhang et al., 19 Nov 2025, Vandenhirtz et al., 3 Feb 2026, Lin et al., 25 Dec 2025, Kim et al., 13 Jan 2026).

4. Modalities: Conditioning, Fusion, and Generalization

All-modality generative recommendation systems are designed to operate over arbitrary combinations of modalities:

Multimodal item representation—Each item is represented with parallel tokenizations for different modalities (e.g., text, image, collaborative, spatial), either concatenated in late fusion or merged through explicit fusion layers (e.g., soft-gating, MLP, cross-attention) (Zhang et al., 19 Nov 2025, Vandenhirtz et al., 3 Feb 2026, Lin et al., 25 Dec 2025).
User-aware and context-aware conditioning—User histories combine sequentially interleaved multimodal tokens, with personalized prompts guiding generation (Liu et al., 2 Jun 2025). New modalities are incorporated by extending pretrained encoders and quantizers or adapting reward functions to modality-relevant metrics (e.g., LPIPS/SSIM for images, ASR+CLIP for speech) (Liu et al., 2 Jun 2025).
Generalization—While most published models target text and images, frameworks are explicitly proposed as modality-agnostic, with potential for audio/video extension, given appropriate discrete tokenization and evaluation metrics (Liu et al., 2 Jun 2025, Wei et al., 2024).
Collaborative and knowledge signals—Recent models explicitly include collaborative features as an additional modality, encoded and quantized alongside content representations and fused prior to code generation (MSCGRec, CEMG) (Vandenhirtz et al., 3 Feb 2026, Lin et al., 25 Dec 2025).

A representative overview:

Model	Tokenization	Fusion	Modalities (paper)
Janus-Pro-1B	VQ-VAE (images), text	Unified Transformer	Image, text
MACRec	RQ-VAE (text, image), cross-modal alignment	T5 + scoring ensemble	Text, image
MSCGRec	RQ (text, image, coll)	Token concat + custom rel. pos.	Text, image, collab
UniMP	Flat sequence, VL cross-attn	Foundation LM + Cross-attn	Image, text, ID
CEMG	RQ-VAE (fusion)	Soft-gating fusion (collab-guided)	Text, image, collab

5. Empirical Evaluation and Comparative Performance

Evaluation frameworks for all-modality generative recommendation are grounded in both information retrieval and generation metrics:

IR-style Top-k metrics: HitRate@K, Recall@K, NDCG@K—measuring ranking quality of predicted item sets; NDCG formulations per standard IR conventions (Zhang et al., 19 Nov 2025, Vandenhirtz et al., 3 Feb 2026, Hou et al., 31 Oct 2025).
Semantic and perceptual metrics: CLIP similarity (historical relevance, potential future relevance), DINO (visual similarity), PCS (profile-conditioned similarity), LPIPS, SSIM, MS-SSIM, NIMA (aesthetics) for image generation (Liu et al., 2 Jun 2025).
Qualitative & Human Evaluation: Pairwise win/lose/tie studies comparing generated samples; preference aggregation showing RL-augmented models preferred by human annotators (Liu et al., 2 Jun 2025).
Efficiency & Scalability: Memory footprint, codebook utilization, inference speed (users/second), cold-start and long-tail item coverage, scaling laws (Vandenhirtz et al., 3 Feb 2026, Lin et al., 25 Dec 2025, Zhai et al., 20 Jun 2025).

Empirically, all-modality generative recommenders consistently outperform both single-modality and retrieval-based systems across domains and modalities. Notable empirical findings include:

Janus+SFT+RL achieves the highest relevance and aesthetic scores on MovieLens and PixelRec; RL tuning lifts performance by up to 10 points on CLIP-based metrics (Liu et al., 2 Jun 2025).
MACRec outperforms MQL4GRec by +5–6% NDCG@10 via deep cross-modal alignment (Zhang et al., 19 Nov 2025).
TriAlignGR’s multitask (SID, text, image) alignment surpasses baseline HR@5/NDCG@5 by +13–15% (Zeng et al., 5 May 2026).
MSCGRec, fusing semantic, collaborative, and constrained decoding, improves NDCG@10 by up to +36% versus prior generative methods and matches or exceeds top discriminative models at industrial scale (Vandenhirtz et al., 3 Feb 2026).
Large benchmark competitions (e.g., Tencent Advertising Algorithm Challenge 2025) identify key contributions from heavy multimodal quantization, explicit feedback/action-type conditioning, and conversion-weighted loss functions (Pan et al., 4 Apr 2026).

6. Challenges, Limitations, and Open Directions

Current all-modality generative recommendation approaches face several critical challenges and research frontiers:

Tokenization conflicts and codebook underutilization—Without adequate contrastive or cross-modal alignment, different modalities can yield redundant or ambiguous semantic IDs, undermining discriminative power (Kim et al., 13 Jan 2026). Strategies such as contrastive tokenization, modality-aware product quantization, and chain-of-thought interest mining mitigate these issues (Zhai et al., 20 Jun 2025, Kim et al., 13 Jan 2026, Zeng et al., 5 May 2026).
Content degradation and semantic opacity—Sequential quantization can discard high-level semantic and latent user interest information (SID Content Degradation—SCD) and produce token sequences without explicit semantic mapping (SID Semantic Opacity—SSO). TriAlignGR addresses this via two-stage CMSA/MDIM before quantization and triangular multitask alignment (Zeng et al., 5 May 2026).
Bias, fairness, and explanation—Distributional and selection biases, particularly popularity bias, pose risks. LLM4Rec introduces causal debiasing (propensity scoring, do-calculus, adversarial losses) and enforces demographic fairness, alongside LLM-based explainable generation (Ma et al., 2 Oct 2025).
Scalability and deployment—While compact code vocabulary and constrained decoding enable large-scale deployment, Trie-based decoding and long code sequences remain potential bottlenecks for massive catalogs (Lin et al., 25 Dec 2025). Memory and inference costs must be continually monitored.
Generalization to novel modalities—Audio, video, spatial, and multi-source context (e.g., sensor/IoT data) are only at the threshold of integration, requiring new tokenizers, reward functions, and training paradigms (Liu et al., 2 Jun 2025, Wei et al., 2024). Open benchmarks (e.g., TencentGR-10M) are starting to include such breadth at scale (Pan et al., 4 Apr 2026).
Robustness and hallucination—Generated content must remain grounded in user history and actual item candidates; hallucination, spurious alignment, and reward hacking can undermine trust and effectiveness (Liu et al., 2 Jun 2025, Hou et al., 31 Oct 2025).

7. Synthesis and Prospects

All-modality generative recommendation represents a foundational shift in recommender system design, dissolving modality boundaries and retrieval constraints in favor of end-to-end generative architectures capable of personalized content synthesis. This paradigm, exemplified by approaches leveraging hierarchical cross-modal tokenization, unified autoregressive modeling, RL-driven personalization, and rigorous empirical evaluation, demonstrates clear advantages in personalization, extensibility, and generative capacity (Liu et al., 2 Jun 2025, Zhang et al., 19 Nov 2025, Vandenhirtz et al., 3 Feb 2026).

The crucial architectural ingredients—discrete semantic token spaces, robust multimodal alignment, fusion with collaborative signals, and pragmatic deployment strategies—are rapidly converging toward scalability and effectiveness for real-world information systems. Ongoing research focuses on multimodal scalability, robustness, fairness, and support for new content types, with open industrial benchmarks accelerating progress.

Ultimately, all-modality generative recommendation provides both a practical framework and a conceptual blueprint for next-generation intelligent assistants and content platforms, capable of holistic, user-driven synthesis across text, vision, and beyond (Hou et al., 31 Oct 2025, Wei et al., 2024).