MQL4GRec: Quantitative Language for Recommendations

Updated 4 September 2025

MQL4GRec is a generative recommendation framework that transforms multimodal item content into concise, discrete codewords forming a universal quantitative language.
It employs modality-specific encoders and Residual-Quantized VAEs to generate token sequences that align text and image data for robust cross-domain transfer.
Empirical evaluations show notable improvements in NDCG and HR metrics, underscoring its effectiveness in bridging PLMs with recommendation systems.

Multimodal Quantitative Language for Generative Recommendation (MQL4GRec) is a generative recommendation framework that unifies multi-domain, multimodal item content into a concise, transferable language comprised of discrete codewords. Its core objective is to bridge the gap between general-purpose pre-trained LLMs (PLMs) and the specific requirements of recommender systems—particularly in leveraging complementary information from diverse modalities (text, image) for robust knowledge transfer and accurate recommendation generation (Zhai et al., 20 Feb 2025). MQL4GRec advances the state-of-the-art by offering a tokenized, quantitative representation that serves as a universal interface across domains and modalities.

1. Unified Quantitative Language Representation

MQL4GRec introduces a unique approach to item representation by transforming multimodal content from various domains into a shared, compact quantitative language. Instead of relying on natural language descriptions or random identifiers, every item is translated into a sequence of discrete codewords, forming a vocabulary that is consistent and interpretable across domains and modalities. This is achieved through quantitative translators—modality-specific encoders followed by multi-level residual quantization—yielding codeword tuples that encode rich semantic information in a concise form.

For each item and each modality (text or image):

A frozen encoder (e.g., LLaMA for text, ViT for image) generates a high-dimensional latent h.
A Residual-Quantized Variational AutoEncoder (RQ-VAE) quantizes h into a tuple of codewords: for each quantization level $i$ $i$ ,
- $c_i = \arg\min_k \| r_i - v_{ik} \|_2^2$
- $r_{i+1} = r_i - v_{ic_i}$
- with $r_1 = z$ (the encoder output), and codebooks $\{ v_{ik} \}$ .
The sequence $[c_1, \ldots, c_L]$ is further annotated with a modality-specific prefix, ensuring information such as source modality is not lost during unification.

This tightly-coupled, token-based quantitative language offers significant practicality for cross-domain and cross-modal transfer, contrasting with text-based and random-ID baselines which typically lack such transferability or semantic compactness (Zhai et al., 20 Feb 2025).

2. Quantitative Translators and Modality Handling

Quantitative translators form the architectural backbone of MQL4GRec, deterministically mapping each item’s multimodal content into token sequences in the quantitative language. The RQ-VAE components are modality-aware, utilizing different frozen encoders for text and images, and maintaining distinct codebooks. To prevent modality confusion, codewords in the quantitative sequences are prefixed according to their source (e.g., lower-case for text, upper-case for image), which allows the downstream generative model to compute cross-modal relationships explicitly.

The system also addresses codeword collision—a scenario where distinct items are mapped to the same token sequence—by introducing a reallocation strategy, ranking items based on the distance between their residuals and the assigned codebook vectors, then disambiguating tokens accordingly.

This modality-aware quantization ensures that both shared and complementary knowledge between modalities are efficiently captured, overcoming the traditional mismatch between generic PLM knowledge and the specificity required in recommendation tasks (Zhai et al., 20 Feb 2025).

3. Quantitative Language Generation and Alignment Tasks

Upon obtaining tokenized quantitative representations, MQL4GRec defines generation tasks to infuse semantic and recommendation-specific knowledge:

Next Item Generation (NIG): Predicts the token sequence (quantitative language) for the next item, separately for text and image modalities.
Asymmetric Item Generation (AIG): Enables cross-modal knowledge transfer by predicting tokens of one modality given those of another, e.g., predicting image tokens from text token history.
Quantitative Language Alignment (QLA): Explicitly aligns the quantitative token representations between modalities via joint learning.

During training, pre-training and fine-tuning stages are differentiated:

In pre-training, the system is trained on the source domain using a subset of generation tasks to establish a robust mapping between raw modalities and the quantitative language.
In fine-tuning on the target domain, all generation and alignment tasks are combined to transfer and enrich recommendation knowledge via a conditional language generation (negative log-likelihood) objective.

The loss for a generation task is as follows:

$\mathcal{L}(\theta) = -\sum_j \log P_\theta(Y_j | Y_{<j}, X)$

where $Y_j$ is the output sequence (quantitative tokens of the next item) and $X$ is the input (history sequence in quantitative language).

Furthermore, candidate items generated through separate modality branches are re-ranked by score fusion, for example by averaging their output probabilities if an item appears in both lists.

4. Knowledge Transfer Across Domains and Modalities

A primary innovation of MQL4GRec is its capacity for efficient recommendation knowledge transfer. By expressing multimodal item data from disparate domains in the same quantitative language, MQL4GRec facilitates universal pre-training and subsequent fine-tuning on downstream tasks. This unification addresses the common limitations of naive PLM-based or unimodal generative recommenders, particularly for:

Cold-start scenarios: When user or item interaction data is sparse, quantitative language allows leveraging well-aligned content knowledge across modalities and domains.
Heterogeneous catalogs: The model’s vocabulary spans all participating domains, allowing it to bridge gaps between structurally dissimilar datasets.

These transfer properties are empirically validated via substantial improvements in NDCG (11.18%, 14.82%, and 7.95% across three benchmarks), marking a quantitative advance over strong baselines such as VIP5, P5, and TIGER (Zhai et al., 20 Feb 2025).

5. Empirical Evaluation and Ablations

MQL4GRec is thoroughly evaluated using Amazon Product Reviews benchmarks, specifically by:

Pre-training on six domains (Pet Supplies, Cell Phones, Automotive, etc.).
Fine-tuning on three sequential recommendation datasets (Musical Instruments, Arts Crafts and Sewing, Video Games).
Benchmarking against established baselines: GRU4Rec, SASRec, BERT4Rec, MISSRec, P5-CID, VIP5, and TIGER.

Evaluation metrics include HR@K and NDCG@K for $K=1,5,10$ . MQL4GRec consistently outperforms both classical sequential recommenders and recent generative/multimodal approaches. Ablation studies confirm that both the set of designed generation tasks and the quantitative language pre-training contribute significantly to performance, with notable performance drops when these components are omitted.

Inference time is increased due to the need for autoregressive generation, a common property in sequence-to-sequence frameworks. However, this is balanced by the universality and conciseness of the quantitative language representation.

6. Limitations, Challenges, and Future Directions

A limitation of the current MQL4GRec design is its reliance on the availability of full item content; scenarios with missing modality data are not explicitly addressed. The inference speed is constrained by the sequential nature of token generation, which is a known bottleneck for autoregressive models.

Future research directions indicated include:

Inference optimization: Exploring more efficient decoding strategies to mitigate generation latency.
Robustness to incomplete data: Extending the framework to handle missing text or image content, potentially through semi-supervised learning or imputation within the quantitative language space.
Broader modality integration: While current design focuses on text and image, the architecture is amenable to extension to audio, video, and structured metadata by building appropriate quantitative translators.
Enhanced fusion: Investigating more sophisticated multi-view or hierarchical codeword fusion and alignment mechanisms to further boost universality and transfer.

7. Significance and Impact

MQL4GRec advances generative recommendation by framing the learning and transfer of recommendation knowledge as a quantitative language modeling problem. This unification not only improves the ability to generalize across domains and modalities but also provides a tractable tokenization that aligns with state-of-the-art autoregressive generative modeling paradigms. The framework is empirically substantiated to yield robust and transferable recommendations and provides a solid foundation for further multimodal and cross-domain innovations in generative recommender systems (Zhai et al., 20 Feb 2025).

PDF Markdown Chat (Pro)

References (1)

Multimodal Quantitative Language for Generative Recommendation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Multimodal Quantitative Language for Generative Recommendation (MQL4GRec).