MLLMRec-R1: Multimodal Rec via GRPO
- The paper introduces MLLMRec-R1, a GRPO-based post-training framework for multimodal sequential recommendation that reduces online costs by converting images into textual descriptions.
- It employs a three-stage multimodal chain-of-thought process—including pseudo-CoT generation, refinement via DeepSeek-R1, and confidence-aware filtering—to mitigate reward inflation and enhance ranking accuracy.
- Empirical results on MovieLens, Netflix, and MicroLens indicate statistically significant improvements in HR and NDCG metrics, validating the framework's efficiency and robustness.
MLLMRec-R1 is a GRPO-based post-training framework for multimodal sequential recommendation (MSR) that is designed to make reasoning useful, affordable, and stable in a recommendation setting. The framework addresses two obstacles identified for R1-style post-training in MSR: the prohibitive cost of jointly encoding visual content for historical interactions and candidate items during group-based rollout, and reward inflation under naive Chain-of-Thought (CoT) supervision, where higher training rewards do not reliably translate into better ranking. Its central design is to textualize visual signals offline, construct high-quality multimodal CoT supervision through refinement and confidence-aware assessment, and apply GRPO on a mixed dataset that contains mostly standard recommendation samples plus only a small fraction of high-confidence CoT-enriched samples (Wang et al., 6 Mar 2026).
1. Problem formulation and motivating constraints
MLLMRec-R1 studies sequential recommendation with multimodal item content. For a user , the recent interaction sequence is written as
and the model predicts the next item . The recommendation decision is cast as instruction-following over a candidate pool
where denotes sampled negatives. In the reported setup, the candidate set typically contains one positive and nine negatives, so the model selects from ten candidates, and all models use only the most recent nine interactions, with the last one serving as prediction target (Wang et al., 6 Mar 2026).
The paper identifies two fundamental constraints. The first is visual token explosion. In the reported formulation, one image can become about 196 visual tokens. In MSR, visual inputs exist on both sides of the decision problem: every historical item may carry an image, and every candidate item may also carry an image. Under GRPO, rollout cost is multiplied again by group sampling. The second constraint is reward inflation under CoT supervision. The paper argues that recommendation CoTs can accidentally encode shortcuts or target-related artifacts, so optimizing them directly may raise training reward without improving ranking quality (Wang et al., 6 Mar 2026).
These constraints motivate the framework’s systems-level choices. The appendix summarizes the online cost as roughly
with the GRPO term dominating in practice. The overall design therefore attempts to reduce the effective sequence length by removing raw visual tokens from online training and to stabilize policy improvement by keeping the RL reward tightly aligned with verifiable recommendation correctness rather than with free-form explanation quality (Wang et al., 6 Mar 2026).
2. Offline visual textualization and multimodal representation
A defining component of MLLMRec-R1 is the offline textualization of visual signals. Each item is represented as a multimodal input
where is the cover image and 0 is the title. An MLLM generates a caption
1
The implementation uses Qwen-VL / Qwen3-VL for this textualization stage (Wang et al., 6 Mar 2026).
The purpose of these captions is not generic captioning but recommendation-oriented semantic compression. The reported descriptions emphasize fine-grained cover-aligned signals such as style, tone, visual motifs, scene composition, character presence, color palette, and aesthetic cues. The paper’s case study associates such descriptions with patterns including monochrome or low-saturation style, serious character-centered compositions, and period-drama tone (Wang et al., 6 Mar 2026).
This design has an important architectural consequence. Although the framework is positioned as a method for multimodal sequential recommendation, the online recommender no longer consumes raw images during SFT or GRPO. Instead, multimodal information is preserved in textual form and injected into prompts and CoT construction. The reported formulation therefore moves the expensive multimodal processing entirely into an offline phase, while the online policy behaves as a text-prompted recommender initialized from a multimodal-capable family (Wang et al., 6 Mar 2026).
The paper presents this as the key efficiency mechanism. By converting images once into reusable textual descriptions, MLLMRec-R1 avoids repeated visual encoder invocation during SFT, GRPO rollout, and inference. The trade-off is implicit rather than fully quantified in a dedicated runtime table: textualization reduces cost substantially in principle, but it may also lose some visual detail relative to direct raw-image conditioning. The framework compensates for that loss by pairing captions with sequence-level reasoning supervision rather than relying on captions alone (Wang et al., 6 Mar 2026).
3. Multimodal CoT construction, refinement, and confidence filtering
The framework’s CoT pipeline has three stages: caption generation, pseudo-CoT generation, and CoT refinement. Given only the multimodal history, the MLLM generates a step-by-step reasoning trace
2
The paper emphasizes that this pseudo-CoT is history-conditioned and generated without target-item information. Its role is to expose sequence-level multimodal preference cues rather than to serve as final reasoning supervision (Wang et al., 6 Mar 2026).
The pseudo-CoT is then refined by DeepSeek-R1: 3 This refinement stage is intended to denoise shallow or inconsistent pseudo-CoTs and produce a stronger trace with better multimodal grounding, improved logical consistency, greater signal density, and reduced leakage risk. The paper also reports a sanitization step that removes accidental mentions of item identifiers or future interactions when present (Wang et al., 6 Mar 2026).
Filtering is performed through a confidence-aware assessment based on two consistency scores. For each item in the sequence, a multimodal encoder 4 produces embeddings, and the sequence-level modality consistency is
5
This score measures how well image and title semantics align across the history. The prediction consistency score compares the refined CoT’s predicted next-item profile to the true target: 6 The combined confidence score is
7
and a quantile threshold
8
selects the retained high-confidence CoT subset (Wang et al., 6 Mar 2026).
The reported retention ratios are deliberately small: typically 9 in SFT and 0 in GRPO, because on-policy RL is more sensitive to noisy supervision. The appendix additionally evaluates CoT quality on 200 samples using GPT-5.2, Claude-4.5, and human assessment across five dimensions—modality consistency, prediction consistency, signal density, leakage risk, and coverage hardness—and reports that refined CoT outperforms pseudo-CoT across all five dimensions (Wang et al., 6 Mar 2026).
4. SFT, GRPO, and mixed-grained data augmentation
The final training set mixes two granularities of supervision: 1 For selected high-confidence samples, 2 is attached; otherwise 3, so most examples remain ordinary recommendation instructions. This mixed-grained data augmentation is the paper’s main defense against reward inflation: only a small, filtered subset carries CoT, and the majority of training data remains standard recommendation data (Wang et al., 6 Mar 2026).
SFT uses the usual maximum-likelihood objective
4
GRPO then samples a group of outputs 5 per prompt from the rollout policy 6, with temperature and top-7 both set to 0.9 during rollout. The sequence-level relative advantage is broadcast to tokens: 8 The clipped GRPO objective is written as
9
where
0
The algorithmic core is therefore conventional GRPO, but the surrounding pipeline is recommendation-specific (Wang et al., 6 Mar 2026).
The reward is intentionally minimal. If the output violates the required format [ITEM_xxxx] Title > ..., it receives
1
if the format is valid, it receives
2
If the predicted item matches the ground truth, the reward adds 3, yielding
4
A central feature of the framework is that the reasoning text itself is not rewarded explicitly. The paper presents this as a stability decision: CoT is injected as supervision context, but GRPO is optimized only against verifiable recommendation success and format validity (Wang et al., 6 Mar 2026).
The implementation uses Qwen3-VL-8B-Instruct for multimodal CoT data construction, DeepSeek-R1 for CoT refinement, and a Qwen-based LLM recommender backbone for MLLMRec-R1. Training is reported on 8 RTX PRO 6000 GPUs with LoRA rank = 16, learning rate 5, gradient accumulation 8, per-device batch size 2 for SFT and 4 for GRPO, and epoch counts that vary by dataset: SFT uses 3 epochs on MovieLens and Netflix and 5 on MicroLens, while GRPO uses 3 epochs on MovieLens and MicroLens and 2 on Netflix (Wang et al., 6 Mar 2026).
5. Empirical results, ablations, and reported limitations
The evaluation covers MicroLens, Netflix, and MovieLens-1M / Movielens. The datasets are split in a 7:1:2 train/validation/test ratio by number of sequences. Their reported statistics are: MicroLens with 25,411 users, 41,081 items, 223,263 interactions, and density 6; Netflix with 13,187 users, 17,366 items, 68,933 interactions, and density 7; and MovieLens with 6,040 users, 3,952 items, 1,000,209 interactions, and density 8. The main metrics are HR and NDCG under one-positive-plus-nine-negative evaluation, with additional experiments at 100 candidates (Wang et al., 6 Mar 2026).
| Dataset | MLLMRec-R1 | Relative gain over second best |
|---|---|---|
| MovieLens-1M | HR@3 0.7630, HR@5 0.8368, NDCG@3 0.6524, NDCG@5 0.6784 | 15.69%, 9.19%, 15.82%, 11.82% |
| MicroLens | HR@3 0.6627, HR@5 0.7906, NDCG@3 0.5845, NDCG@5 0.6365 | 9.86%, 11.95%, 13.14%, 12.69% |
| Netflix | HR@3 0.7150, HR@5 0.8670, NDCG@3 0.5902, NDCG@5 0.6293 | 9.66%, 5.67%, 13.61%, 7.44% |
The paper states that these improvements are statistically significant at 9. Under a 100-candidate setting, the model remains strong. On MovieLens, it reports HR@5 = 0.2444 versus RecZero 0.1963, HR@10 = 0.2983 versus 0.2418, NDCG@5 = 0.1806 versus 0.1432, and NDCG@10 = 0.1969 versus 0.1530, corresponding to gains of 24.50%, 23.37%, 26.12%, and 28.69%, respectively (Wang et al., 6 Mar 2026).
The ablation study is central to the paper’s argument. On MovieLens, the full model’s HR@3 = 0.7630 drops to 0.6139 without GRPO, 0.6465 without mixed-grained data augmentation, 0.6665 without multimodal CoT, 0.6924 without CoT refinement, 0.7095 without pseudo-CoT, and 0.7243 without captions. Similar degradations appear on MicroLens and Netflix. The strongest evidence for the paper’s reward-inflation diagnosis is the w/o MDA variant, which uses all CoT data without selective mixing and performs substantially worse than the full system (Wang et al., 6 Mar 2026).
The appendix also reports qualitative and optimization evidence. Reward curves rise rapidly in the first 0.2–0.5 epoch and stabilize around 2–3 epochs, with Netflix showing noisier but still improving behavior. Increasing group size 0 from 2 to 8 improves HR and NDCG, moderate 1 yields the best stability/effectiveness trade-off, and overly aggressive filtering reduces coverage and hurts performance. In the MovieLens case study for user sequence 1127, a text-only GRPO model overemphasizes Friday the 13th franchise repetition, whereas MLLMRec-R1 ranks All Quiet on the Western Front at top-1 and places Metropolitan and Blood Simple high by leveraging finer multimodal cues (Wang et al., 6 Mar 2026).
The paper also leaves several limitations visible. It does not provide a direct wall-clock or GPU-memory comparison table for the claimed efficiency gains, even though it gives strong asymptotic motivation. It leaves broader evaluation on other LLM backbones such as LLaMA or Gemma to future work because prompt adjustment would be costly. It also does not extensively tune some GRPO decoding hyperparameters because preliminary settings worked well. This suggests that the method’s efficiency and robustness claims are strongly supported at the design level and by ranking results, but are less directly benchmarked as systems measurements in the reported excerpt (Wang et al., 6 Mar 2026).
6. Position within adjacent R1-style and multimodal recommendation research
MLLMRec-R1 sits at the intersection of multimodal recommendation, reasoning-oriented post-training, and RL-based ranking. Its nearest conceptual precursor in multimodal recommendation is MLLMRec, which converts item images into textual descriptions with Gemma3-27b, reasons over user interaction histories to produce preference text, and refines an item-item graph with threshold-controlled denoising and topology-aware enhancement. That earlier paper reports that it does not mention “MLLMRec-R1” at all, so the later framework should not be read as an explicitly named versioned successor inside the original graph-based work (Dang et al., 21 Aug 2025).
Within R1-style recommendation, RecLLM-R1 provides a two-stage SFT + GRPO paradigm with CoT for recommendation, but its method description is fundamentally text-centric and does not concretely specify image, audio, or other non-text encoders (Xie et al., 24 Jun 2025). R2ec instead integrates intrinsic reasoning directly into a text-only recommender and optimizes reasoning and recommendation jointly with outcome-derived reward, without requiring reasoning annotations (You et al., 22 May 2025). ConvRec-R1 / Rank-GRPO argues that recommendation RL should optimize at the rank level rather than the token or whole-sequence level, using rank-level importance ratios and causal rank rewards in conversational recommendation (Zhu et al., 23 Oct 2025). Retrv-R1 addresses multimodal retrieval rather than recommendation, but contributes a broader R1-style pattern—coarse-to-fine retrieval, compressed candidate reasoning, selective detail inspection, synthetic CoT activation, and GRPO with efficiency-aware reward—that is explicitly reported to transfer to multimodal recommendation after fine-tuning (Zhu et al., 3 Oct 2025).
Taken together, these papers suggest a broader evolution. MLLMRec emphasizes offline multimodal textualization and purified user/item semantics; RecLLM-R1 and R3ec emphasize reasoning-centric post-training for recommendation; Rank-GRPO emphasizes credit-assignment granularity for ranked generation; and Retrv-R1 emphasizes efficient reasoning over large multimodal candidate sets. MLLMRec-R1 occupies a distinct position in this landscape by combining offline visual textualization, automatically refined multimodal CoT, confidence-aware sparse CoT injection, and standard GRPO with a lightweight format-plus-hit reward in a multimodal sequential recommendation setting (Wang et al., 6 Mar 2026).
A further implication is methodological rather than historical. MLLMRec-R1 does not reward explanations directly and does not keep raw images in the rollout loop. Instead, it treats reasoning as a carefully curated training signal and keeps RL reward tied to verifiable recommendation outcomes. That design choice differentiates it from both direct raw-MLLM rollout approaches and from reasoning frameworks that rely on dense process-level reward. The reported code release at https://github.com/wangyu0627/MLLMRec-R1 positions the framework as a practical reference implementation for this specific design point (Wang et al., 6 Mar 2026).