DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender Systems
Abstract: Recommender systems have demonstrated significant impact across diverse domains, yet ensuring the reproducibility of experimental findings remains a persistent challenge. A primary obstacle lies in the fragmented and often opaque data management strategies employed during the preprocessing stage, where decisions about dataset selection, filtering, and splitting can substantially influence outcomes. To address these limitations, we introduce DataRec, an open-source Python-based library specifically designed to unify and streamline data handling in recommender system research. By providing reproducible routines for dataset preparation, data versioning, and seamless integration with other frameworks, DataRec promotes methodological standardization, interoperability, and comparability across different experimental setups. Our design is informed by an in-depth review of 55 state-of-the-art recommendation studies ensuring that DataRec adopts best practices while addressing common pitfalls in data management. Ultimately, our contribution facilitates fair benchmarking, enhances reproducibility, and fosters greater trust in experimental results within the broader recommender systems community. The DataRec library, documentation, and examples are freely available at https://github.com/sisinflab/DataRec.
- Gediminas Adomavicius and Jingjing Zhang. 2012. Impact of data characteristics on recommender systems performance. ACM Trans. Manag. Inf. Syst. 3, 1 (2012), 3:1–3:17.
- Reinforcement Learning based Recommender Systems: A Survey. ACM Comput. Surv. 55, 7 (2023), 145:1–145:38.
- Elliot: A Comprehensive and Rigorous Framework for Reproducible Recommender Systems Evaluation. In SIGIR. ACM, 2405–2414.
- Considering temporal aspects in recommender systems: a survey. User Model. User Adapt. Interact. 33, 1 (2023), 81–119.
- Offline evaluation options for recommender systems. Inf. Retr. J. 23, 4 (2020), 387–410.
- Proceedings of the 2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems, HetRec ’11, Chicago, Illinois, USA, October 27, 2011. ACM.
- POG: Personalized Outfit Generation for Fashion Recommendation at Alibaba iFashion. In KDD. ACM, 2662–2670.
- Beyond NDCG: Behavioral Testing of Recommender Systems with RecList. In WWW (Companion Volume). ACM, 99–104.
- The Datasets Dilemma: How Much Do We Really Know About Recommendation Datasets?. In WSDM. ACM, 141–149.
- Friendship and mobility: user movement in location-based social networks. In KDD. ACM, 1082–1090.
- Are we really making much progress? A worrying analysis of recent neural recommendation approaches. In RecSys. ACM, 101–109.
- A Review of Modern Recommender Systems Using Generative Models (Gen-RecSys). CoRR abs/2404.00579 (2024).
- How Dataset Characteristics Affect the Robustness of Collaborative Recommendation Models. In SIGIR. ACM, 951–960.
- Michael D. Ekstrand. 2020. LensKit for Python: Next-Generation Software for Recommender Systems Experiments. In CIKM. ACM, 2999–3006.
- Microsoft recommenders: tools to accelerate developing recommender systems. In RecSys. ACM, 542–543.
- F. Maxwell Harper and Joseph A. Konstan. 2016. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. 5, 4 (2016), 19:1–19:19.
- Ruining He and Julian J. McAuley. 2016. Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering. In WWW. ACM, 507–517.
- Bridging Language and Items for Retrieval and Recommendation. arXiv preprint arXiv:2403.03952 (2024).
- Contrastive Self-supervised Learning in Recommender Systems: A Survey. ACM Trans. Inf. Syst. 42, 2 (2024), 59:1–59:39.
- Serdar Kadioglu and Bernard Kleynhans. 2024. Building Higher-Order Abstractions from the Components of Recommender Systems. In AAAI. AAAI Press, 22998–23004.
- ClayRS: An end-to-end framework for reproducible knowledge-aware recommender systems. Inf. Syst. 119 (2023), 102273.
- Julian J. McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: understanding rating dimensions with review text. In RecSys. ACM, 165–172.
- Image-Based Recommendations on Styles and Substitutes. In SIGIR. ACM, 43–52.
- Exploring Data Splitting Strategies for the Evaluation of Recommendation Models. In RecSys. ACM, 681–686.
- RecPack: An(other) Experimentation Toolkit for Top-N Recommendation using Implicit Feedback Data. In RecSys. ACM, 648–651.
- Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects. In EMNLP/IJCNLP (1). Association for Computational Linguistics, 188–197.
- Alan Said and Alejandro BellogÃn. 2014. Comparative recommender system evaluation: benchmarking recommendation frameworks. In RecSys. ACM, 129–136.
- Cornac: A Comparative Framework for Multimodal Recommender Systems. J. Mach. Learn. Res. 21 (2020), 95:1–95:5.
- Aixin Sun. 2023. Take a Fresh Look at Recommender Systems from an Evaluation Standpoint. In SIGIR. ACM, 2629–2638.
- Are We Evaluating Rigorously? Benchmarking Recommendation for Reproducible Evaluation and Fair Comparison. In RecSys. ACM, 23–32.
- Tianchi. 2018. IJCAI-16 Brick-and-Mortar Store Recommendation Dataset. https://tianchi.aliyun.com/dataset/dataDetail?dataId=53
- Make It a Chorus: Knowledge- and Time-aware Item Modeling for Sequential Recommendation. In SIGIR. ACM, 109–118.
- MIND: A Large-scale Dataset for News Recommendation. In ACL. Association for Computational Linguistics, 3597–3606.
- Graph Neural Networks in Recommender Systems: A Survey. ACM Comput. Surv. 55, 5 (2023), 97:1–97:37.
- Yelp. 2018-2022. Yelp Recommendation Dataset. https://www.yelp.com/dataset
- Eva Zangerle and Christine Bauer. 2023. Evaluating Recommender Systems: Survey and Framework. ACM Comput. Surv. 55, 8 (2023), 170:1–170:38.
- Report on the 2nd Workshop on the Perspectives on the Evaluation of Recommender Systems (PERSPECTIVES 2022) at RecSys 2022. SIGIR Forum 56, 2 (2022), 15:1–15:4.
- Deep Learning Based Recommender System: A Survey and New Perspectives. ACM Comput. Surv. 52, 1 (2019), 5:1–5:38.
- RecBole: Towards a Unified, Comprehensive and Efficient Framework for Recommendation Algorithms. In CIKM. ACM, 4653–4664.
- Open Benchmarking for Click-Through Rate Prediction. In CIKM. ACM, 2759–2769.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.