- The paper presents three analytical approaches, including sequential rule mining and model-based evaluation, to assess dataset order integrity.
- It finds that significant performance drops in key metrics occur after shuffling, evidencing strong sequential patterns in datasets like ML-20m and 30Music.
- The study questions the suitability of popular datasets for sequential recommendations, urging improved dataset-task alignment in future research.
An Analysis of Datasets for Sequential Recommendations Evaluation
The paper "Does It Look Sequential? An Analysis of Datasets for Evaluation of Sequential Recommendations," authored by Anton Klenitskiy, Anna Volodkevich, Anton Pembek, and Alexey Vasilev, presents an in-depth study on the presence and strength of sequential structures in datasets commonly used for evaluating Sequential Recommender Systems (SRSs). SRSs leverage the order of user interactions to predict future actions, rendering the sequential patterns in datasets crucial for accurate evaluation. This study applies a range of analytical methods to scrutinize the sequential integrity of 15 widely used datasets.
Methodological Approaches
Three primary approaches are proposed to assess the strength of sequential patterns within these datasets:
- Sequential Rules:
- This model-agnostic approach relies on mining sequential association rules in the form of 2-grams and 3-grams.
- Analyzing the discrepancy in rule counts before and after shuffling user sequences serves as an indicator of the dataset’s sequential structure.
- Model-based Approaches:
- Performance Degradation:
- Sequential models (SASRec and GRU4Rec) are trained on original datasets, and their performance in terms of HitRate@10 and NDCG@10 is evaluated before and after shuffling test sequences.
- A significant performance drop post-shuffling would imply a strong sequential structure.
- Top-K Jaccard Score:
- The Jaccard similarity between top-K recommendation lists for original and shuffled sequences is measured.
- Higher Jaccard scores suggest weaker reliance on sequential patterns.
Data and Preprocessing
The datasets include a mix of academic benchmarks and recent industrial datasets across varied domains such as e-commerce, music, gaming, social networking, and more. To preserve the sequential nature and avoid data leakage, the study combines global temporal splits with leave-one-out constraints. Preprocessing involved retaining interactions corresponding to item views or clicks and applying a 5-core filtering to ensure consistent evaluation conditions.
Results and Findings
Sequential Rules Analysis
The analysis revealed that datasets like Beauty, Sports, Games, Steam, and Yelp displayed a high decline in 2-grams and 3-grams rules post-shuffling, indicating strong sequential patterns before shuffling. However, the initial rule counts for some of these datasets were quite low, suggesting caution in interpreting the results.
Model-based Results
The degradation in SRS performance post-shuffling and the Jaccard Index exhibits wide variability across datasets. Datasets such as MegaMarket, ML-20m, 30Music, and Zvuk indicated a robust sequential structure with significant accuracy drops and low Jaccard similarity post-shuffling. Conversely, datasets like Foursquare, Gowalla, RetailRocket, Steam, and Yelp exhibited minimal performance drops and higher Jaccard similarity, suggesting weaker sequential structures.
Implications and Future Directions
These findings question the suitability of several popular datasets for evaluating SRSs due to their weak sequential structures. The robustness of conclusions drawn about model performance in previous studies using these datasets may thereby be undermined. Future work may include a deeper investigation into dataset-task alignment and refining criteria for sequential pattern assessment, enhancing the reliability of SRS evaluations.
Conclusion
This paper underscores the importance of properly assessing dataset suitability for sequential recommendation tasks. By incorporating methods that distinguish the extent of sequential patterns, researchers can avoid misalignment issues, potentially leading to more accurate and credible evaluation outcomes for SRSs.
1
2
|
\bibliographystyle{ACM-Reference-Format}
\bibliography{content/7_bibliography} |