MLLMRec-R1: Incentivizing Reasoning Capability in Large Language Models for Multimodal Sequential Recommendation

Published 6 Mar 2026 in cs.IR | (2603.06243v1)

Abstract: Group relative policy optimization (GRPO) has become a standard post-training paradigm for improving reasoning and preference alignment in LLMs, and has recently shown strong effectiveness in LLM-based recommender systems. However, extending GRPO-based reasoning pipelines to multimodal sequential recommendation (MSR) with multimodal LLMs (MLLMs) faces fundamental obstacles. First, MSR requires jointly encoding visual content for both historical interactions and multiple candidate items, causing visual tokens to dominate the input and making the cost of group-based rollout scale with history length and candidate set size, which renders GRPO-based training prohibitively expensive. Second, existing Chain-of-Thought (CoT) supervision suffers from reward inflation in recommendation scenarios, where higher training rewards do not reliably translate into improved ranking performance and may induce shortcut learning. To address these challenges, we propose MLLMRec-R1, an efficient and stable GRPO-based reasoning framework for multimodal sequential recommendation. MLLMRec-R1 textualizes visual signals offline to eliminate expensive visual tokens while preserving multimodal semantics, and constructs high-quality multimodal CoT supervision through refinement and confidence-aware assessment. Furthermore, a mixed-grained data augmentation strategy selectively injects reliable CoT samples while retaining standard training data, mitigating reward inflation and improving generalization stability. Extensive experiments on three benchmark datasets demonstrate that MLLMRec-R1 consistently outperforms state-of-the-art methods, establishing a practical and effective GRPO-based reasoning pipeline for multimodal sequential recommendation. The code is available at https://github.com/wangyu0627/MLLMRec-R1.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces an offline pipeline that textualizes visual signals into detailed captions and refines pseudo-CoT rationales for efficient multimodal recommendation.
It integrates GRPO optimization with strict filtering of high-confidence reasoning traces, leading to performance gains up to 15.82% on key metrics.
Empirical evaluations on diverse datasets demonstrate enhanced model stability, reduced shortcut learning, and practical scalability for large candidate sets.

MLLMRec-R1: Advancing Multimodal Sequential Recommendation via Incentivized Reasoning in LLMs

Introduction and Motivation

The paper addresses the inherent bottlenecks in multimodal sequential recommendation (MSR) when leveraging Multimodal LLMs (MLLMs) with Group Relative Policy Optimization (GRPO). Traditional architectures for MSR demand the joint encoding of numerous visual tokens, leading to substantial compute and latency, especially as the history length and candidate set size increase. Critically, despite this expense, the expected gains over text-only LLMs often remain marginal (Figure 1).

Figure 1: (a) Visual tokens make compute grow with history length and candidate set size, yet often yield limited gains over LLMs, revealing an efficiency bottleneck. (b) Some CoT data may boost training reward scores but hurt test performance, revealing shortcut learning and poor generalization.

Additionally, existing Chain-of-Thought (CoT) supervision frameworks exhibit reward inflation—models achieve high training rewards that do not translate to improved test-time ranking, enabling shortcut learning and undermining generalization. The central contributions focus on optimizing both efficiency and stability in MSR pipelines by textualizing multimodal content offline and constructing filtered multimodal CoT datasets, thereby controlling spurious signals and augmenting generalization.

Multimodal CoT Compression Pipeline

MLLMRec-R1 introduces an offline pipeline for compressing visual signals into fine-grained textual captions via an MLLM, thus eliminating the overhead of visual-token computation while preserving cross-modal semantics for downstream tasks. Structured pseudo-CoT rationales are generated from interaction histories, revealing grounded multimodal cues in text form. These are further refined using a text-only reasoning model (DeepSeek-R1), establishing high-quality supervision for GRPO without target-item leakage or label contamination (Figure 2).

Figure 2: Multimodal CoT construction pipeline—caption generation, pseudo-CoT construction, and CoT refinement enable efficient, target-free multimodal supervision.

Empirical ablations confirm that removing any stage (caption, pseudo-CoT, or refinement) results in significant degradation, implicating the necessity of each component for robust multimodal reasoning and preference modeling.

Mixed-Grained Data Augmentation and Filtering

To combat reward inflation and shortcut learning, the proposed framework injects only high-confidence CoT samples into the training set, mixing them with standard, no-CoT prompts at a low ratio. Quality is assured via two metrics:

Modality Consistency: Scores alignment between title and image embeddings.
Prediction Consistency: Assesses the accuracy of the refined CoT profile relative to the target item.

Low-score samples are filtered out, and mixing ratio $p$ is tuned (typically $0.1$ for SFT, $0.05$ for GRPO) for optimal stability and generalization (Figure 3).

Figure 3: Modality consistency and prediction consistency scores govern filtered augmentation in the instruction data, blending high-confidence CoT with standard prompts.

This selective augmentation maintains generalization under GRPO and avoids performance inflation from noisy or spurious reasoning signals.

GRPO Optimization and Lightweight Reward Rules

GRPO post-training is applied after initial SFT fine-tuning, activating reasoning traces across a fixed instruction template. GRPO uses group-wise sampling per context, calculating relative advantage and imposing stability via KL to reference policy (DeepSeek-R1). The reward function is explicitly decomposed (Figure 4):

Format Check: Penalizes responses not adhering to prescribed template.
Hit Check: Rewards correct prediction of ground-truth item ID.
No Explicit Reward for Reasoning Text: To prevent reward hacking from templated or hallucinated explanations.
Figure 4: GRPO employs group sampling and stable reward rules (format check, hit check) for reliable optimization signals.

This structure ensures consistent and aligned optimization towards top-ranked, semantically matched recommendations.

Empirical Evaluation and Numerical Analysis

Extensive experiments on MovieLens-1M, MicroLens, and Netflix datasets substantiate statistically significant performance gains: MLLMRec-R1 outperforms all baselines with relative improvements up to 15.82% on NG@3 and similar gains on HR@3 across all datasets. Radar chart evaluations further highlight that CoT refinement elevates data quality in modality consistency, prediction consistency, signal density, leakage risk, and coverage hardness—outscoring pseudo-CoT across annotators (GPT-5.2, Claude-4.5, humans) (Figure 5).

Figure 5: CoT refinement yields higher scores over pseudo-CoT on five data-quality dimensions in both automated and human evaluations.

Model scale analyses reveal that larger Qwen3-VL backbones consistently deliver higher ranking metrics, corroborating necessity for scale in leveraging multimodal CoT supervision (Figure 6). Hyperparameter sensitivity confirms improvements from higher group sizes $G$ , moderate KL $\beta$ , and careful filtering ratios $p$ (Figure 7).

Figure 6: Larger Qwen3-VL backbones significantly improve HR@3 and NDCG@3 across datasets.

Figure 7: Hyperparameter studies reveal clear impacts of group size, filtering ratio, and KL coefficient on HR and NDCG metrics.

Reward convergence is rapid within 0.2–0.5 epoch, stabilizing after 2–3 epochs with higher reward plateaus on MovieLens and MicroLens (Figure 8).

Figure 8: GRPO-based training reward curves on benchmark datasets show rapid early convergence and stable late-stage plateaus.

Case studies further evidence reduction in shortcut reliance and stronger alignment with fine-grained multimodal preferences (Figure 9).

Figure 9: Multimodal CoT reasoning improves within-group distinction and target ranking in MovieLens sequence test cases.

Implications and Future Research Directions

MLLMRec-R1 demonstrates that combining offline textualization of multimodal signals with filtered CoT data and lightweight GRPO reward design yields scalable, stable improvements in preference modeling for sequential recommendation. The approach manages efficiency bottlenecks and generalization failures inherent in naive multimodal LLM architectures. The framework offers practical tractability for large candidate sets and long interaction histories.

Theoretical implications include the utility of modality consistency and prediction consistency as fundamental data quality metrics. Observed gains validate the role of targeted CoT supervision and groupwise RL signal aggregation in incentivizing deep reasoning. The results motivate further exploration of adaptive filtering, automatic leakage-risk assessment, and model-scale selection.

Practically, the research unlocks efficient deployment of multimodal recommender systems in resource-constrained environments, avoiding the exponential compute of visual-token pipelines. The methodology is directly extensible to other cross-modal reasoning domains (e.g., visual question answering, entity recognition) and can inform future reinforcement-learning post-training paradigms for MLLMs.

Future work should investigate more robust reward structures beyond format and hit checks, adaptive augmentation ratios, and transferability to domains with heterogeneous modalities (audio, video, text). Integration with retrieval-augmented generation and automatic CoT quality evaluators also represents promising directions.

Conclusion

MLLMRec-R1 sets a new state-of-the-art for multimodal sequential recommendation with significant quantitative and qualitative improvements over both LLM- and MLLM-based architectures. By textualizing multimodal evidence, constructing high-quality CoT supervision, and employing mixed-grained augmentation with GRPO-based optimization, the framework achieves superior alignment, ranking, and generalization. The approach offers both algorithmic efficiency and practical scalability, with broad implications for the design of robust, reasoning-driven recommender systems in multimodal domains (2603.06243).

Markdown Report Issue