Multi-modal Embedding Extraction

Updated 15 May 2026

Multi-modal embedding extraction is a process that creates unified representations by fusing data from text, image, and audio modalities.
Key methodologies integrate Transformer architectures and learned hierarchical codebooks to balance semantic expressiveness with collaborative alignment.
This approach enables effective cross-modal contrastive learning and robust tokenization, driving improvements up to 20% in recommendation quality.

Generative Recommendation Systems (GRSs) represent a paradigm shift in personalized information access by unifying candidate retrieval, ranking, and—even in some scenarios—content creation within autoregressive generative frameworks. Rooted in advances in large-scale LLMs, multimodal representation learning, and scalable Transformer architectures, GRSs frame recommendation as conditional sequence generation, treating user–item interaction modeling as a structured generative process over semantic or collaborative token spaces. This approach is distinguished from classical discriminative or multi-stage retrieval paradigms by its potential to integrate world knowledge, exploit scaling laws, and enable new capabilities in reasoning, multi-modality, and adaptive deployment at internet scale (Hou et al., 31 Oct 2025, Yang et al., 9 Jul 2025, Wang et al., 19 Feb 2025).

1. Core Concepts and Mathematical Foundations

A GRS learns a conditional distribution over recommended sequences $S = (s_1, \ldots, s_T)$ given user $u$ and context $c$ —which may include user profile, session state, item features, or rich prompts:

$P(S \mid u, c;\theta) = \prod_{t=1}^T P(s_t \mid s_{<t}, u, c; \theta)$

This formulation subsumes classical next-item prediction, sequential slate generation, and multi-step recommendations. The end-to-end training objective is typically the sequence negative log-likelihood (NLL):

$\mathcal{L}_{\text{NLL}} = -\sum_{(u, c, S) \in D} \sum_{t=1}^{|S|} \log P(s_t \mid s_{<t}, u, c;\theta)$

GRSs may represent items as semantic IDs (discrete, quantized codewords reflecting content and collaborative attributes (Liu et al., 29 Sep 2025, Wang et al., 2024)), text tokens (item titles or descriptions), or even rich multimodal feature tokens. In many industrial deployments, the tokenization is performed via learned, hierarchical codebooks designed to balance semantic expressiveness, collaborative alignment, and code assignment diversity (Wang et al., 2024, Zhang et al., 19 Nov 2025).

2. Architectural Paradigms

GRSs encompass multiple architectural strategies:

Decoder-only and encoder–decoder Transformers: Sequence modeling frameworks where item (or action) tokens are generated autoregressively, conditioned on histories, user features, and possibly action or context tokens (Yang et al., 9 Jul 2025, Zou et al., 16 Apr 2026).
LLM-based GRS: Direct fine-tuning or prompting of LLMs, with options to inject collaborative signals via prompt engineering, adapter layers, or explicit ID tokenization (Wang et al., 19 Feb 2025, Yang et al., 9 Jul 2025, Liu et al., 29 Sep 2025).
Multi-modal GRS: Integration of text, image, or video modalities into the token space. Approaches include parallel quantization of multimodal embeddings, late-fusion with modality markers, contrastive cross-modal alignment during codebook learning, and explicit cross-modal generation losses (Zhang et al., 19 Nov 2025, Zhu et al., 30 Mar 2025).
Retrieval-oriented generative frameworks: Use of session-level embedding generation for high-cardinality retrieval, bypassing strict autoregressive positional constraints to enable efficient candidate generation (Liang et al., 16 Aug 2025).

A concise taxonomy is provided in the table below.

Paradigm	Key Features	Example Methods
SID-based GR	Autoregressive on semantic ID tokens; codebook design for content/collab.	TIGER, LC-Rec, LETTER (Wang et al., 2024, Liu et al., 29 Sep 2025)
LLM-as-RS	Direct generative modeling over item text; no discrete codebook	Qwen3, OneRec
Multi-modal GR	Codebook/tokenization for text, image, audio; contrastive alignment	MACRec (Zhang et al., 19 Nov 2025), MGR-LF++ (Zhu et al., 30 Mar 2025)
End-to-end Retrieval	Session-level, position-free embedding generation; ANN retrieval	TBGRecall (Liang et al., 16 Aug 2025), TencentGR (Pan et al., 4 Apr 2026)

3. Tokenization, Multimodal Fusion, and Collaborative Alignment

Efficient and expressive item tokenization is central. SID-based architectures construct identifiers through residual-quantized VAEs or MoE-based codebooks, with regularization losses enforcing semantic reconstruction, collaborative similarity (via contrastive learning on CF embeddings), and codeword utilization diversity (Wang et al., 2024, Zhang et al., 19 Nov 2025). Multimodal GRSs address the longstanding challenge of integrating text, vision, and other signals:

Parallel cross-modal quantization: Jointly learns text/image codebooks, regularized by layerwise contrastive and alignment losses to minimize collision rates and maximize codebook utilization (Zhang et al., 19 Nov 2025).
Late fusion with modality markers: Concatenates modality-specific semantic code sequences, inserting special tokens to clarify modality transitions and applying cross-modal contrastive objectives for alignment (Zhu et al., 30 Mar 2025).
Hybrid generation and ensemble scoring: During decoding, constrained beam search is applied over per-modality vocabularies, with ensemble or explicit alignment losses optimizing the final selection (Zhang et al., 19 Nov 2025).

Such mechanisms result in improved performance (up to 20% relative), enhanced code assignment distribution, and increased robustness to cold-start and multi-domain signals.

4. Training Objectives, Fine-Tuning, and Preference Optimization

Beyond classical sequence NLL, GRSs employ a suite of objectives to address exposure bias, preference ranking, and policy alignment:

GFlowNet-based fine-tuning: Treats multi-step generation as trajectory sampling, optimizing the flow such that $\pi_\theta(\tau) \propto R(\tau)$ for trajectory $\tau$ , with rewards incorporating observed positives, collaborative filtering signals, and token-level similarity to exemplars (Wang et al., 19 Jun 2025).
Listwise direct preference optimization: Moves beyond independent token prediction by directly enforcing item-level partial-order preferences (e.g., purchase ≻ click ≻ exposure) through listwise softmax losses (Fu et al., 9 Feb 2026).
Page-wise and session-wise supervision: Densifies gradients and corrects one-to-many ambiguities in paginated or sessionized requests, accelerating convergence and reducing hallucination rates (Zou et al., 16 Apr 2026, Liang et al., 16 Aug 2025).
Reinforcement Learning (RL) and hybrid reward models: Incorporates user preference models, group-relative policy optimization, and supervised NLL regularization to align generation with observed and inferred satisfaction targets (Zou et al., 16 Apr 2026, Xing et al., 27 Feb 2026).
Reflection-correction mechanisms: Applies structured error localization and correction after initial token generation, supervised and further optimized by RL under task/trajectory-level rewards (Xing et al., 27 Feb 2026).
Model editing for cold-start collapse: Facilitates training-free injection of new item token sequences via targeted parameter updates and position-wise gating, yielding strong cold-item recall with an order-of-magnitude faster updates than retraining (Shen et al., 15 Mar 2026).

5. Industrial-Scale Datasets, Benchmarks, and System Considerations

With the deployment of GRSs in production, new public benchmarks and system frameworks have emerged:

All-modality advertising datasets: TencentGR-1M/10M (Pan et al., 4 Apr 2026) offer large-scale, multimodal logs with multi-action signals (impression, click, conversion), supporting open evaluation of large generative recommenders. Feature schemas span hashed IDs, multi-modal embeddings, and temporally structured action sequences, with evaluation metrics weighted to reflect practical business objectives (e.g., conversion gain).
Scalability and system co-design: TurboGR (Chai et al., 13 May 2026) demonstrates optimized, jagged-operator-aware training on NPUs, achieving >54% MFU and near-linear scalability via fused attention kernels, dynamic jagged load balancing, semi-asynchronous communication, and negative sampling offload. This enables training of >0.2B parameter GRSs with long (8k+) sequences on industrial hardware, a critical step toward production-scale deployment.
Best practices from challenge competitions: Winning solutions in large advertising challenges highlight critical design levers including robust time encoding, semantic ID quantization with long-tail regularization, very large negative banks under contrastive losses, ANN-based scalable inference, and deep backbone architectures—though with sensitivity to model size, negative pool, and training stability.

Dataset	#Users	#Items	Interactions	Modalities	Key Features
TencentGR-1M	1M	4.7M	90M (avg. 91/user)	Text, Image	Clicks, exposures
TencentGR-10M	10M	17.5M	973M (avg. 97/user)	Text, Image	Clicks, exposures, conversions

6. Open Problems, Challenges, and Future Directions

GRSs introduce new capabilities but also surface domain-specific technical challenges:

Scaling and bottlenecks: Discrete codebook tokenizations (SID-based) saturate in performance under size scaling, while LLM-as-RS models exhibit smooth scaling laws and up to 20% higher recall at fixed data scale (Liu et al., 29 Sep 2025, Wang et al., 19 Feb 2025). The limited capacity of learned codes to retain semantic information is a critical bottleneck, suggesting a need for hybrid or end-to-end tokenization schemes.
Benchmarking and robustness: Static, single-turn datasets remain insufficient; the field lacks standardized, multi-turn, multi-modal interactive benchmarks (Hou et al., 31 Oct 2025). Robustness to adversarial and cold-start scenarios is an ongoing concern, partially addressed by model editing (Shen et al., 15 Mar 2026) and synthetic agent-based simulations (Hou et al., 31 Oct 2025).
Preference and alignment: Capturing nuanced, multi-objective, and long-horizon user preferences involves integrating reward models, RL fine-tuning, listwise and groupwise ranking objectives, and regularizer-guided token learning (Fu et al., 9 Feb 2026, Zou et al., 16 Apr 2026).
Real-time efficiency: GRSs must optimize Prompt/ANN/beam search, minimize latency, and manage resource-intense negative sampling. Industry solutions exploit unique system-level co-design, including hierarchical sparse parallelism, pipeline orchestration, and asynchronous updates (Chai et al., 13 May 2026).
Explainability, fairness, and ethical considerations: Interpretable decision-making, fairness under streaming data and demographic shifts, and resilience to popularity and position biases receive increasing scrutiny (Hou et al., 31 Oct 2025).

Continued research is focusing on dynamic benchmarking with user agents, unified multi-task LLM backbones, hybrid continuous–discrete representations, and advanced training techniques for continual learning, prompt adaptation, and efficient inference. The confluence of data-centric engineering, scaling law–informed architecture, and optimized system design defines the present and near future of generative recommendation research at scale (Wang et al., 19 Feb 2025, Chai et al., 13 May 2026, Hou et al., 31 Oct 2025).

7. Summary Table: Key GRS Methodologies

GRS Type	Tokenization & Input	Objective(s)	Major Algorithms/Models
SID-based GenRec	Quantized item codes	NLL, InfoNCE, RL/DPO	TIGER, LC-Rec, LETTER (Wang et al., 2024)
Multi-Modal GR	Codebooks per modality	Contrastive, reconstr.	MACRec (Zhang et al., 19 Nov 2025), MGR-LF++ (Zhu et al., 30 Mar 2025)
LLM-as-RS	Item text, prompts	Generative LM fine-tuning	Qwen3, OneRec
Retrieval-oriented GenRec	Embeddings, session IDs	Contrastive, InfoNCE	TBGRecall (Liang et al., 16 Aug 2025), TencentGR (Pan et al., 4 Apr 2026)
RL/Preference Aligned	SID/sequence tokens	RL (GRPO, DPO), ranking	GenRec (Zou et al., 16 Apr 2026), RankGR (Fu et al., 9 Feb 2026)
Error Correction/Editing	Token-level sequences	SFT, RL, parameter editing	GRC (Xing et al., 27 Feb 2026), GenRecEdit (Shen et al., 15 Mar 2026)

Each methodology addresses fundamental requirements of large-scale, industrial recommendation: unification of retrieval and ranking, adaptation to rich multimodal content, alignment to true user preference hierarchies, and robustness to both system scaling and dynamic inputs.

Generative Recommendation Systems thus consolidate decades of research into a new, cohesive paradigm, leveraging LLM-induced scaling laws, end-to-end differentiability, and unified sequence modeling to create highly adaptable, knowledge-infused, and potentially fully conversational recommendation engines (Hou et al., 31 Oct 2025, Wang et al., 19 Feb 2025, Yang et al., 9 Jul 2025). Key open directions include surmounting tokenization-induced bottlenecks, fully harnessing cross-modal context, and translating algorithmic advances into reliably scalable, fair, and interpretable production deployments.