Similarity-Guided Diffusion for Sequential Recommendation
- The paper introduces a similarity-guided diffusion framework that integrates semantic noise with contrastive learning to improve sequential recommendation.
- It employs deterministic noise injection based on item similarity and confidence-guided augmentation to preserve the contextual integrity of user sequences.
- Empirical results show significant gains in metrics like HR@K and NDCG@K, outperforming several state-of-the-art models on standard datasets.
Similarity-Guided Diffusion for Contrastive Sequential Recommendation describes a framework that integrates semantic similarity-driven augmentation within a diffusion-based generative process to enhance contrastive learning in sequential recommendation systems. The approach, exemplified by SimDiffRec (Choi et al., 16 Jul 2025), prioritizes semantic and contextual preservation during sequence augmentation by leveraging item-item similarity and model confidence scores, addressing key limitations of random augmentation methods and improving the discriminative quality of positive and negative samples for contrastive representation learning.
1. Motivation and Conceptual Overview
Traditional sequential recommendation (SR) models, particularly those utilizing Transformer architectures, grapple with severe data sparsity and the challenge of encoding long-range dependencies in user interaction histories. To counteract sparsity and enrich representation learning, recent advances have embraced data augmentation and contrastive learning. However, many such methods employ stochastic or random strategies (e.g., random noise injection, masking, or cropping), which often disturb the contextual structure of the original sequence and degrade recommendation quality.
Similarity-Guided Diffusion for Contrastive Sequential Recommendation introduces a principled augmentation mechanism in which only semantically consistent noise—computed based on learned item embedding similarities—is diffused through user sequences. Further, the model strategically localizes augmentation to positions in the sequence where the denoising model is highly confident, thereby preserving the underlying behavioral context while maximizing the utility of contrastive learning.
2. Framework and Methodological Components
2.1 Semantic Similarity-Based Noise
SimDiffRec identifies, for each item embedding in the input sequence, the top most similar items in the embedding matrix (excluding the item itself), based on dot-product similarity:
where is the complete item embedding matrix.
The selected similar embeddings produce a deterministic noise vector for diffusion:
where indexes the top most similar items. This method ensures that noise injected during the diffusion process remains semantically close to the original item, minimizing the risk of corrupting contextual information.
2.2 Diffusion Process
The forward process incorporates this similarity-guided noise in a deterministic manner:
With repeated application (for steps), this yields:
where denotes the original input, and , are schedule parameters controlling the retaining of original information and the injection of noise at each step.
In the reverse process, a parameterized denoising network predicts a normal distribution over the previous step:
This denoises the sequence embedding iteratively, culminating in a recovered sequence representation that is then mapped back to discrete items using a learned rounding function.
2.3 Confidence-Guided Position Selection
SimDiffRec utilizes the model's own denoising confidence to select augmentation positions. For each position in the sequence, the denoising model outputs logits:
yielding a softmax probability for each candidate item:
with denoting the vocabulary size and the hidden representation at position .
The confidence score is computed as
Positions with the highest confidence values are chosen for augmentation, ensuring that augmentation occurs where the model's understanding (as reflected by high restoration probability) is robust, thereby mitigating the risk of introducing misleading or noisy augmentations.
2.4 Loss Objectives
The overall training objective integrates three loss components:
- Sequential Recommendation Loss (): Typically a standard next-item prediction loss.
- Contrastive Loss (): Employing the InfoNCE formulation with hard negative sampling,
where contains hard negatives selected by similarity to the positive samples.
- Diffusion Loss (): Measures the deviation between the original and recovered (denoised) sequences:
The total loss is given by:
where and are weighting parameters.
3. Contrastive Learning with Discriminative Sample Generation
The similarity-guided diffusion mechanism generates augmented sequences that retain contextual accuracy and structural coherence, resulting in higher-quality positive pairs for contrastive learning. Hard negative sampling is carried out by drawing negatives that are semantically similar in the embedding space, based on the denoised output's item probability distribution. This practice encourages the model to distinguish between subtle variations in user behavior and to learn more granular representations.
The empirical evidence in the paper (Choi et al., 16 Jul 2025) demonstrates that this strategy yields state-of-the-art results in HR@K and NDCG@K across multiple standard datasets.
4. Practical Considerations and Implementation Details
4.1 Model Architecture
SimDiffRec employs a Transformer-based architecture with two attention layers for the recommendation encoder and a one-layer Transformer for the diffusion model. Item and sequence embedding dimensions are harmonized, and optimization is performed with Adam (learning rate 0.001), batch size 256, and maximum sequence lengths adapted for the dataset in use (typically 50–200).
Augmentation parameters—the number of similar items used for noise construction, and the ranking index for negative sampling—are tuned to accommodate item sparsity across domains.
4.2 Efficiency and Scalability
The deterministic nature of the similarity-guided forward diffusion reduces stochastic variance and improves reproducibility when generating augmentations. Confidence-based augmentation ensures computational focus on regions of the sequence most beneficial for training, enhancing sample efficiency. Ablation studies suggest that all submodules—semantic similarity-based noise, confidence-based position selection, and hard negative sampling—contribute materially to performance.
5. Empirical Evaluation
5.1 Datasets and Evaluation Protocol
Experiments are conducted on five public datasets: Amazon Beauty, Toys, Sports, Yelp, and MovieLens-1M. Metrics include Hit Ratio (HR@k) and Normalized Discounted Cumulative Gain (NDCG@k). The evaluation protocol is leave-one-out, with the last user interaction reserved for testing.
5.2 Results
SimDiffRec consistently outperforms nine baseline models, representing classical (BERT4Rec, SASRec), contrastive learning (CL4SRec, DuoRec, MCLRec, ECL-SR), and diffusion-based (DiffuASR, DreamRec, CaDiRec) recommendation frameworks. Relative improvements over the next-best, such as CaDiRec, range from 4.16% to 13.36% on HR@K and NDCG@K, depending on the dataset.
Ablation studies show that removing any of the core similarity-guided mechanisms leads to a notable drop in performance, highlighting their necessity.
Further analysis reveals that optimal settings for vary with domain sparsity; for low-sparsity domains, 25–50 is optimal, while high-sparsity domains favor lower values (1–5). Sensitivity to loss weighting parameters is dataset-dependent, suggesting that small-scale hyperparameter tuning is essential for deployment.
Visualization using t-SNE confirms that confidence-based augmentation produces tighter, more coherent clusters in the learned embedding space, affirming semantic consistency and improved representation structure.
6. Broader Context and Future Directions
Similarity-Guided Diffusion for Contrastive Sequential Recommendation establishes a link between item similarity-driven augmentation and robust contrastive representation learning. By directly integrating structure from the item embedding space and leveraging model confidence, this approach overcomes key weaknesses in random augmentation and enhances the fidelity of learned user preferences.
The potential applicability of this framework extends to other recommendation scenarios, including session-based recommendation and contexts where side information or multi-modal data is available, provided suitable similarity metrics are devised for noise generation. Further exploration into adaptive and context-dependent similarity computation and augmentation position selection, as well as integration with explicit intent or semantic retrieval signals, are plausible directions arising from this work.
7. Comparison with Related Approaches
SimDiffRec (Choi et al., 16 Jul 2025) distinguishes itself from earlier random-augmentation-based diffusion or contrastive methods (Li et al., 2023, Du et al., 2023, Liu et al., 2023) by explicitly aligning noise and augmentation strategy with semantic relationships derived from embeddings. This reduces the risk of context loss and improves representation learning, as evidenced by significant empirical gains. Related contrastive frameworks relying on global graph-based augmentation (Zhang et al., 2022, Wang et al., 2022), explanation-guided augmentation (Wang et al., 2022), or latent intent-aware guidance (Qu et al., 22 Apr 2025) can inform the continued refinement of similarity-guided diffusion models.
In summary, similarity-guided diffusion for contrastive sequential recommendation provides an effective, semantically coherent augmentation strategy for SR models, leveraging both item similarity and confidence-driven guidance to improve both training stability and recommendation quality.