LLaDA-Rec: Diffusion-Based Generative Recommendation

Updated 11 November 2025

LLaDA-Rec is a discrete diffusion-based framework for generative recommendation, producing semantic IDs in parallel with bidirectional attention.
It overcomes autoregressive limitations by employing iterative masked token generation and dual masking objectives to better model intra-item and inter-item dependencies.
Empirical evaluations on real-world datasets show significant improvements in Recall@5 and NDCG, validated by detailed ablation studies.

LLaDA-Rec is a discrete diffusion-based generative recommendation framework that advances next-item prediction by generating semantic ID (SID) sequences in parallel with bidirectional attention. Traditional semantic-ID recommenders are constrained by left-to-right (autoregressive) decoding, which impedes global semantic modeling and accumulates errors through fixed generation order. LLaDA-Rec circumvents these limitations by recasting recommendation as an iterative, masked parallel generation task using a discrete diffusion process, supported by a tailored parallel tokenization scheme, dual masking objectives, and an inference-stage adaptive beam search. These innovations enable LLaDA-Rec to model both intra-item and inter-item dependencies and yield improved performance across several real-world datasets.

1. Problem Formulation and Limitations of Autoregressive Decoders

Generative recommendation predicts the next item a user will interact with by generating its semantic ID, $s_n = [c_{n,1}, c_{n,2}, \ldots, c_{n,M}]$ , a length- $M$ discrete token sequence, where each $c_{n,m}$ is drawn from a shared vocabulary of size $|\mathcal{W}|$ . Given a user history of $n-1$ items, represented as an ordered SID sequence $\mathcal{S}_H = [c_{1,1}, \ldots, c_{n-1,M}]$ , the task is to model

$\theta^* = \arg\max_\theta P_\theta(s_n \mid \mathcal{S}_H)$

Prior methods—e.g., TIGER and LC-Rec—use autoregressive decoding, factorizing the conditional probability:

$P_\theta(s_n \mid \mathcal{S}_H) = \prod_{m=1}^M P_\theta(c_{n,m} \mid c_{n,<m}, \mathcal{S}_H)$

Autoregression causes two fundamental problems:

Unidirectional Constraints: Tokens can only attend to their precursors, hindering modeling of global coherence within the SID.
Error Accumulation: Errors in early token predictions propagate, elevating risk for subsequent token mispredictions.

2. Discrete Diffusion Framework for Parallel SID Generation

LLaDA-Rec replaces autoregressive decoding with a discrete diffusion process that operates in parallel and is bidirectional. Generation proceeds in $T$ steps, starting from a fully masked SID, iteratively reconstructing masked tokens.

Forward (Noise) Process

A family of masking operators $\{q_r\}$ parameterized by mask ratio $r \in [0,1]$ is defined:

$q_r(z^r \mid z^0) = \prod_{i} \left[ (1-r) \delta(z^r_i = z^0_i) + r \delta(z^r_i = [\mathrm{MASK}]) \right]$

With increasing $r$ , the original token sequence is progressively masked until all positions are hidden.

Reverse (Denoising) Process

A single Mask Predictor, $p_\theta$ , reconstructs clean tokens from partially masked input. For each diffusion step $t$ , a ratio $r_t$ is selected, and $p_\theta$ is trained to recover $z^0$ given $z^{r_t}$ and the context:

$p_\theta(z^0 \mid z^{r_t}, \text{context}) \approx q(z^0 \mid z^{r_t})$

Masking-based Training Objectives

LLaDA-Rec employs two complementary losses:

User-history Masking Loss ( $\mathcal{L}_{\mathrm{His-Mask}}$ ):

$\mathcal{L}_{\mathrm{His-Mask}} = -\mathbb{E}_{r, \mathcal{S}_H} \left[ \frac{1}{r} \sum_{i: \mathcal{S}^r_{H,i} = [\mathrm{MASK}]} \log P_\theta(\mathcal{S}_{H,i} \mid \mathcal{S}^r_H) \right]$

Next-item Masking Loss ( $\mathcal{L}_{\mathrm{Item-Mask}}$ ):

$\mathcal{L}_{\mathrm{Item-Mask}} = -\mathbb{E}_{r, s_n} \left[ \frac{1}{r} \sum_{m: c_{n,m}^r = [\mathrm{MASK}]} \log P_\theta(c_{n,m} \mid s_n^r, \mathcal{S}_H) \right]$

The final training objective:

$\mathcal{L}_{\mathrm{Total}} = \mathcal{L}_{\mathrm{Item-Mask}} + \lambda_{\mathrm{His}} \mathcal{L}_{\mathrm{His-Mask}} + \lambda_{\mathrm{Reg}} \|\theta\|_2^2$

3. Parallel Tokenization and Bidirectional Representation Learning

SID generation benefits from parallel and order-agnostic tokenization to leverage the bidirectional model. Standard Residual Quantization VAE (RQ-VAE) hierarchically orders tokens, which is misaligned with bidirectional Transformers. LLaDA-Rec addresses this with a Multi-Head Vector Quantized VAE (Multi-Head VQ-VAE):

Each item embedding $v_i$ (e.g., from Sentence-T5) is encoded by an MLP to $z_i \in \mathbb{R}^d$ , split into $M$ equal sub-vectors $[z_{i,1}; \dots; z_{i,M}]$ .
Each sub-vector $z_{i,m}$ is quantized by nearest neighbor search in codebook $\mathcal{C}_m$ to yield token $c_{i,m}$ .
The $M$ tokens form the SID $s_i = [c_{i,1}, \ldots, c_{i,M}]$ , with no cross-token ordering bias.
Reconstruction proceeds by concatenating token embeddings and decoding with an MLP.

This approach achieves parallelism in ID construction, matching the bidirectional architecture and facilitating global semantic modeling.

4. Dual Masking Regimens in Model Training

Every training example (user history plus next item) is subjected to two simultaneous masking regimens:

History-level Masking: A random ratio $r_H$ is sampled and applied to all tokens in $\mathcal{S}_H$ , driving $\mathcal{L}_{\mathrm{His-Mask}}$ .
Next-item-level Masking: An independent ratio $r_I$ masks tokens in $s_n$ for $\mathcal{L}_{\mathrm{Item-Mask}}$ .

The model processes the concatenated $[\mathcal{S}_H^{r_H}; s_n^{r_I}]$ as input and is optimized to predict the true tokens for every [MASK] position. This strategy enhances the model's capacity for both intra-item and inter-item dependency modeling.

5. Inference: Discrete Diffusion Decoding and Adaptive Beam Search

Standard discrete diffusion yields only one sample, and classic beam search presumes left-to-right generation. To generate top- $k$ SID candidates, LLaDA-Rec introduces adaptive-order beam search:

Initialization: All positions are masked; maintain a beam set $\mathcal{B}_1$ of size $B$ .
At Each Step ( $t$ ):

For each unfilled position $m \notin PG_t$ , compute $P_\theta^{t,m}(w \mid s_n^t, \mathcal{S}_H)$ over $w$ .
Select top $M/T$ positions with highest max-probabilities into $\mathcal{M}_t$ .
Expand the beam over $\mathcal{M}_t$ positions by scoring top- $B$ tokens at each chosen position, iteratively pruned by joint probabilities.
Update each sequence in $\mathcal{B}_{t+1}$ with chosen tokens; re-mask unfilled positions for the next iteration.

After $T$ steps, all beams are completed SIDs. The final top- $k$ recommendations are selected by their joint model probability.

6. Empirical Benchmarking and Ablation

LLaDA-Rec has been evaluated on three Amazon 2023 subsets—Scientific, Instrument, and Game—spanning tens of thousands of users and items, with average session lengths 8.1–8.9. Metrics include Recall@1,5,10 and NDCG@5,10, against a spectrum of ID-based, diffusion-based, and semantic-ID generative recommender baselines:

ID-based: GRU4Rec, SASRec, BERT4Rec, FMLP-Rec, LRURec
Diffusion-based: DreamRec, DiffuRec
Semantic-ID generative: VQ-Rec, TIGER / TIGER-SAS, LETTER, LC-Rec (RQ-VAE), RPG

Across all datasets and metrics, LLaDA-Rec achieves either the best or statistically indistinguishable top performance, with +2–4 point absolute gains in Recall@5 over prior semantic-ID methods and similar lifts in NDCG.

Ablation studies show clear dependencies:

Substituting Multi-Head VQ-VAE with RQ-VAE or orthogonal quantization drops performance by 5–15%.
Removing either masking loss ( $\mathcal{L}_{\mathrm{His-Mask}}$ , $\mathcal{L}_{\mathrm{Item-Mask}}$ ) reduces Recall@5 by 10–20%.
Omitting beam search in favor of greedy sampling reduces all metrics to 1–2%, as only one candidate is yielded.

7. Theoretical Implications, Complexity, and Future Directions

Discrete diffusion unifies generative modeling and retrieval by directly outputting semantic IDs, obviating the need for latent matching with item embedding tables. The use of bidirectional attention and adaptive generation order counters the primary drawbacks of autoregressive decoders.

Inference requires $T$ Transformer passes (each $O(M^2)$ self-attention) and $O(\frac{M}{T} B \log B)$ per-position beam pruning. Empirically, $T \approx M/2$ and $B=16$ strike a favorable balance.

Identified limitations and areas for further work include:

Fixed-length SIDs: Extending the method to variable-length SIDs would generalize its applicability.
Decoding Efficiency: Adapting distillation or one-step denoising LLM approaches may accelerate inference.
Enhanced Personalization: Integrating richer user profile information, textual or graph-based, into the diffusion process offers a promising extension.

In summary, LLaDA-Rec demonstrates that discrete diffusion, paired with parallel tokenization, dual masking, and adaptive-order beam search, constitutes an effective and robust framework for semantic-ID–based generative recommendation, overcoming key challenges faced by prior autoregressive approaches.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to LLaDA-Rec.