Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse MoE

Published 31 Mar 2026 in cs.IR and cs.CL | (2603.29259v1)

Abstract: Preference-based alignment objectives have been widely adopted, from RLHF-style pairwise learning in LLMs to emerging applications in recommender systems. Yet, existing work rarely examines how Direct Preference Optimization (DPO) behaves under implicit feedback, where unobserved items are not reliable negatives. We conduct systematic experiments on multimodal sequential recommendation to compare common negative-selection strategies and their interaction with DPO training. Our central finding is that a simple modification, replacing deterministic hard negatives with stochastic sampling from a dynamic top-K candidate pool, consistently improves ranking performance. We attribute its effectiveness to two factors: (1) reducing erroneous suppressive gradients caused by false negatives, and (2) retaining informative hard signals while smoothing optimization via controlled stochasticity. With an optional sparse Mixture-of-Experts encoder for efficient capacity scaling, RoDPO achieves up to 5.25% NDCG@5 on three Amazon benchmarks, with nearly unchanged inference cost.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces RoDPO, enhancing direct preference optimization for multimodal sequential recommendation with a novel stochastic Top-K negative sampling strategy.
It integrates a sparse Mixture-of-Experts encoder to achieve scalable and efficient multimodal fusion of text and image features.
Experimental results show significant ranking improvements (up to +5.25% NDCG@5 and +7.67% MRR@5) across diverse e-commerce benchmarks.

Robust Preference Alignment for Multimodal Sequential Recommendation with RoDPO

Introduction

The paper "Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse MoE" (2603.29259) presents RoDPO, a framework for aligning Multimodal Sequential Recommendation (MSR) models with user preferences under implicit feedback. This work addresses the systemic brittleness observed with Direct Preference Optimization (DPO) when naively adapted to recommendation, due to the prevalence of false negatives among unobserved items. The authors’ primary contribution is a negative sampling strategy—Stochastic Top- $K$ Negative Sampling—that stabilizes DPO training and substantially improves ranking accuracy. The framework integrates a sparse Mixture-of-Experts (MoE) encoder for scalable multimodal fusion and maintains competitive inference efficiency.

Problem Setting and Challenges

MSR leverages sequential user–item interactions augmented with heterogeneous modalities (text, image) to model dynamic preferences. Traditional training regimes utilize pointwise objectives (e.g., Cross-Entropy), which fail to encode relative preference rankings. Recent advances in LLM alignment—especially RLHF and DPO—motivate the adoption of pairwise preference optimization, but there are domain-specific obstacles when ported to RecSys.

Naive application of DPO in RecSys suffers due to the ill-posed assumption that all non-interacted items are strictly negative. However, in real-world implicit feedback, many items are simply unexposed. Hard negative mining thus frequently selects false negatives for penalization, yielding corrupted gradients that degrade ranking boundaries and model generalization.

Figure 1: Schematic of the "False Negative" dilemma—NLP negatives are reliable, RecSys hard negatives often include unobserved positives, and Top-K sampling mitigates mis-penalization.

RoDPO Framework

RoDPO comprises two central innovations: (1) robust preference construction via stochastic Top- $K$ negative sampling and (2) a scalable multimodal encoder architecture augmented with Sparse MoE.

Multimodal Sequential Encoder

The model is agnostic to architectural backbone but employs modality-specific Transformer encoders to process item IDs and pre-extracted multimodal features, further refined via a temporal module and an optional Sparse MoE layer for parameter-efficient scaling. The fusion weights among modalities are learned, enabling adaptive integration of complementary signals while preserving sequential and temporal dynamics.

Stochastic Top- $K$ Negative Sampling

Rather than deterministically choosing the hardest negative, RoDPO defines a dynamic candidate pool consisting of the top $K$ scoring non-target items at each iteration, from which one is sampled uniformly as the negative. This procedure maintains the advantages of hard negatives (strong gradients) while introducing stochasticity to amortize the risk of repeatedly misclassifying unobserved positives as negatives.

Figure 2: RoDPO high-level architecture—multimodal encoding (left) and stochastic preference pair sampling with DPO training (right).

Direct Preference Optimization Objective

RoDPO leverages DPO to align the current model $\pi_\theta$ with a reference policy $\pi_{\text{ref}}$ derived by supervised pretraining. For each preference pair $(y_w, y_l)$ , the DPO loss is computed as:

$\mathcal{L}_\mathrm{DPO}(x, y_w, y_l) = -\log \sigma\left( \beta \, \left[ \Delta s_\theta(x, y_w, y_l) - \Delta s_\mathrm{ref}(x, y_w, y_l) \right] \right)$

where $\Delta s_\theta(x, y_w, y_l) = s_\theta(x, y_w) - s_\theta(x, y_l)$ , and $\beta$ controls the strength of preference enforcement. Training proceeds in two stages: warm-up via standard likelihood, followed by RoDPO fine-tuning with a frozen reference model.

Empirical Evaluation

Comprehensive experiments were conducted on Amazon-Toys, Beauty, and Home & Kitchen benchmarks. RoDPO demonstrates robust and significant improvements across all evaluation metrics (NDCG@5, MRR@5) relative to diverse time-aware, multimodal, and state-of-the-art baselines.

Strong Numerical Results:

RoDPO yields up to +5.25% NDCG@5 and +7.67% MRR@5 improvement over HM4SR and other leading baselines.
Performance gains remain consistent across cutoffs and are observed in all tested domains.

Key ablations isolating negative sampling strategies confirm that hard negative mining is only marginally beneficial over random sampling—a direct consequence of false negative risk. Stochastic Top- $K$ 0 sampling robustly outperforms both.

Analysis and Ablations

Negative Sampling Size ( $K$ 1): Small $K$ 2 degenerates to hard mining and is sensitive to noise; large $K$ 3 resembles random sampling and attenuates gradients. Optimal performance is realized at $K$ 4, balancing hardness and stochastic smoothing.
DPO Coefficient ( $K$ 5): Best results at $K$ 6; higher than typical NLP settings, reflecting the higher false-negative risk in RecSys and the need for steadfast alignment.
Sparse MoE: When enabled, this module yields further (albeit incremental) improvements in fine-grained preference discrimination without inflating inference cost.
Efficiency: RoDPO matches the inference and training efficiency of competitive methods due to sharing frozen reference weights and partial expert activation in MoE layers.

Preference distribution visualizations exhibit that RoDPO effectively suppresses the model’s overconfidence in hard negatives, redistributing the logit density and yielding a more robust preference margin.

Case Study

A representative user preference trace demonstrates that RoDPO avoids the semantic similarity trap inherent to strong baselines, surfacing relevant but semantically diverse items (matching latent intent present in historical behavior), attributable to robust preference optimization and effective negative candidate balancing.

Theoretical and Practical Implications

This work underscores the necessity of domain-aware adaptation of preference optimization: DPO’s efficacy in LLMs does not directly translate to RecSys due to divergent negative set semantics. Correctly resolving false-negative sensitivity is critical for practical deployment, especially as RecSys models scale to multimodal and foundation architectures.

The light-touch nature of Top- $K$ 7 sampling, which requires only modifications to the negative selection procedure, makes RoDPO applicable as a plug-in to a wide spectrum of backbone architectures and market domains.

Limitations and Future Work

While empirical gains are decisive on e-commerce datasets, generalization to other domains (e.g., short-video, news, or personalized content streams) merits further validation. Scaling training to extremely large item catalogs and multimodal feature sets may incur heightened cost, motivating further research in efficient candidate selection and reference model distillation.

Conclusion

RoDPO demonstrates that robust stochastic negative sampling is a crucial enabler for effective DPO-based alignment in sequential recommendation with implicit feedback. By harmonizing model capacity, optimization stability, and sensitivity to the structural properties of RecSys data, RoDPO closes a critical gap in preference alignment methodologies. The framework paves the way for principled large-scale preference-driven alignment under weak signals and offers immediate applicability to multimodal, sequential recommendation foundations.

Markdown Report Issue