Does It Look Sequential? An Analysis of Datasets for Evaluation of Sequential Recommendations

Published 21 Aug 2024 in cs.IR, cs.AI, and cs.LG | (2408.12008v1)

Abstract: Sequential recommender systems are an important and demanded area of research. Such systems aim to use the order of interactions in a user's history to predict future interactions. The premise is that the order of interactions and sequential patterns play an essential role. Therefore, it is crucial to use datasets that exhibit a sequential structure to evaluate sequential recommenders properly. We apply several methods based on the random shuffling of the user's sequence of interactions to assess the strength of sequential structure across 15 datasets, frequently used for sequential recommender systems evaluation in recent research papers presented at top-tier conferences. As shuffling explicitly breaks sequential dependencies inherent in datasets, we estimate the strength of sequential patterns by comparing metrics for shuffled and original versions of the dataset. Our findings show that several popular datasets have a rather weak sequential structure.

Abstract PDF HTML Upgrade to Chat

Citations (4)

View on Semantic Scholar

Summary

The paper presents three analytical approaches, including sequential rule mining and model-based evaluation, to assess dataset order integrity.
It finds that significant performance drops in key metrics occur after shuffling, evidencing strong sequential patterns in datasets like ML-20m and 30Music.
The study questions the suitability of popular datasets for sequential recommendations, urging improved dataset-task alignment in future research.

An Analysis of Datasets for Sequential Recommendations Evaluation

The paper "Does It Look Sequential? An Analysis of Datasets for Evaluation of Sequential Recommendations," authored by Anton Klenitskiy, Anna Volodkevich, Anton Pembek, and Alexey Vasilev, presents an in-depth study on the presence and strength of sequential structures in datasets commonly used for evaluating Sequential Recommender Systems (SRSs). SRSs leverage the order of user interactions to predict future actions, rendering the sequential patterns in datasets crucial for accurate evaluation. This study applies a range of analytical methods to scrutinize the sequential integrity of 15 widely used datasets.

Methodological Approaches

Three primary approaches are proposed to assess the strength of sequential patterns within these datasets:

Sequential Rules:
- This model-agnostic approach relies on mining sequential association rules in the form of 2-grams and 3-grams.
- Analyzing the discrepancy in rule counts before and after shuffling user sequences serves as an indicator of the dataset’s sequential structure.
Model-based Approaches:
- Performance Degradation:
  - Sequential models (SASRec and GRU4Rec) are trained on original datasets, and their performance in terms of HitRate@10 and NDCG@10 is evaluated before and after shuffling test sequences.
  - A significant performance drop post-shuffling would imply a strong sequential structure.
- Top-K Jaccard Score:
  - The Jaccard similarity between top-K recommendation lists for original and shuffled sequences is measured.
  - Higher Jaccard scores suggest weaker reliance on sequential patterns.

Data and Preprocessing

The datasets include a mix of academic benchmarks and recent industrial datasets across varied domains such as e-commerce, music, gaming, social networking, and more. To preserve the sequential nature and avoid data leakage, the study combines global temporal splits with leave-one-out constraints. Preprocessing involved retaining interactions corresponding to item views or clicks and applying a 5-core filtering to ensure consistent evaluation conditions.

Results and Findings

Sequential Rules Analysis

The analysis revealed that datasets like Beauty, Sports, Games, Steam, and Yelp displayed a high decline in 2-grams and 3-grams rules post-shuffling, indicating strong sequential patterns before shuffling. However, the initial rule counts for some of these datasets were quite low, suggesting caution in interpreting the results.

Model-based Results

The degradation in SRS performance post-shuffling and the Jaccard Index exhibits wide variability across datasets. Datasets such as MegaMarket, ML-20m, 30Music, and Zvuk indicated a robust sequential structure with significant accuracy drops and low Jaccard similarity post-shuffling. Conversely, datasets like Foursquare, Gowalla, RetailRocket, Steam, and Yelp exhibited minimal performance drops and higher Jaccard similarity, suggesting weaker sequential structures.

Implications and Future Directions

These findings question the suitability of several popular datasets for evaluating SRSs due to their weak sequential structures. The robustness of conclusions drawn about model performance in previous studies using these datasets may thereby be undermined. Future work may include a deeper investigation into dataset-task alignment and refining criteria for sequential pattern assessment, enhancing the reliability of SRS evaluations.

Conclusion

This paper underscores the importance of properly assessing dataset suitability for sequential recommendation tasks. By incorporating methods that distinguish the extent of sequential patterns, researchers can avoid misalignment issues, potentially leading to more accurate and credible evaluation outcomes for SRSs.