The Role of Data Filtering in Open Source Software Ranking and Selection (2401.10136v1)
Abstract: Faced with over 100M open source projects most empirical investigations select a subset. Most research papers in leading venues investigated filtering projects by some measure of popularity with explicit or implicit arguments that unpopular projects are not of interest, may not even represent "real" software projects, or that less popular projects are not worthy of study. However, such filtering may have enormous effects on the results of the studies if and precisely because the sought-out response or prediction is in any way related to the filtering criteria. We exemplify the impact of this practice on research outcomes: how filtering of projects listed on GitHub affects the assessment of their popularity. We randomly sample over 100,000 repositories and use multiple regression to model the number of stars (a proxy for popularity) based on the number of commits, the duration of the project, the number of authors, and the number of core developers. Comparing control with the entire dataset with a filtered model projects having ten or more authors we find that while certain characteristics of the repository consistently predict popularity, the filtering process significantly alters the relation ships between these characteristics and the response. The number of commits exhibited a positive correlation with popularity in the control sample but showed a negative correlation in the filtered sample. These findings highlight the potential biases introduced by data filtering and emphasize the need for careful sample selection in empirical research of mining software repositories. We recommend that empirical work should either analyze complete datasets such as World of Code, or employ stratified random sampling from a complete dataset to ensure that filtering is not biasing the results.
- Leveraging Usage Similarity for Effective Retrieval of Examples in Code Repositories. In Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering (Santa Fe, New Mexico, USA) (FSE ’10). Association for Computing Machinery, New York, NY, USA, 157–166. https://doi.org/10.1145/1882291.1882316
- Sebastian Baltes and Paul Ralph. 2022. Sampling in software engineering research: a critical review and guidelines. Empirical Software Engineering 27 (2022), 94. Issue 4. https://doi.org/10.1007/s10664-021-10072-8
- A Large Scale Study of Long-Time Contributor Prediction for GitHub Projects. IEEE Transactions on Software Engineering 47, 6 (2021), 1277–1298. https://doi.org/10.1109/TSE.2019.2918536
- Understanding the Factors That Impact the Popularity of GitHub Repositories. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME). 334–344. https://doi.org/10.1109/ICSME.2016.31
- Software Service Recommendation Base on Collaborative Filtering Neural Network Model. In Service-Oriented Computing, Claus Pahl, Maja Vukovic, Jianwei Yin, and Qi Yu (Eds.). Springer International Publishing, Cham, 388–403.
- A Systematic Mapping Study of Software Development With GitHub. IEEE Access 5 (2017), 7173–7192. https://doi.org/10.1109/ACCESS.2017.2682323
- Findings from GitHub: Methods, Datasets and Limitations. In Proceedings of the 13th International Conference on Mining Software Repositories (Austin, Texas) (MSR ’16). Association for Computing Machinery, 137–141. https://doi.org/10.1145/2901739.2901776
- Sampling Projects in GitHub for MSR Studies. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). https://doi.org/10.1109/MSR52588.2021.00074
- Characterization and Prediction of Popular Projects on GitHub. In 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), Vol. 1. 21–26. https://doi.org/10.1109/COMPSAC.2019.00013
- Influence analysis of Github repositories. SpringerPlus 1 (2016). https://doi.org/10.1186/s40064-016-2897-7
- Ranking significance of software components based on use relations. IEEE Transactions on Software Engineering (2005). https://doi.org/10.1109/TSE.2005.38
- The Promises and Perils of Mining GitHub. In Proceedings of the 11th Working Conference on Mining Software Repositories (Hyderabad, India) (MSR 2014). Association for Computing Machinery, New York, NY, USA, 92–101. https://doi.org/10.1145/2597073.2597074
- Sourcerer: mining and searching internet-scale software repositories. Data Mining and Knowledge Discovery 18, 2 (2009), 300–336. https://doi.org/10.1007/s10618-008-0118-x
- Roderick JA Little and Donald B Rubin. 2019. Statistical analysis with missing data. Vol. 793. John Wiley & Sons.
- World of Code: Enabling a Research Workflow for Mining and Analyzing the Universe of Open Source VCS data. International Journal of Empirical Software Engineering (2020). papers/WoC_EMSE.pdf
- A Recommender Agent for Software Libraries: An Evaluation of Memory-Based and Model-Based Collaborative Filtering. In 2006 IEEE/WIC/ACM International Conference on Intelligent Agent Technology. 154–162. https://doi.org/10.1109/IAT.2006.23
- Audris Mockus. 2008. Missing Data in Software Engineering BT - Guide to Advanced Empirical Software Engineering. Springer London, London, 185–200. https://doi.org/10.1007/978-1-84800-044-5_7
- Audris Mockus. 2014. Engineering Big Data Solutions. In ICSE’14 FOSE. https://dl.acm.org/authorize?N14216
- Curating GitHub for Engineered Software Projects. Empirical Softw. Engg. 22, 6 (dec 2017), 3219–3253. https://doi.org/10.1007/s10664-017-9512-6
- Jerzy Neyman. 1992. On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. In Breakthroughs in Statistics: Methodology and Distribution. Springer, 123–150.
- Charles P. Quesenberry and Nicholas P. Jewell. 1986. Regression Analysis Based on Stratified Samples. Biometrika 73, 3 (1986), 605–614. http://www.jstor.org/stable/2336525
- Is Popularity a Measure of Quality? An Analysis of Maven Components. In 2014 IEEE International Conference on Software Maintenance and Evolution. 231–240. https://doi.org/10.1109/ICSME.2014.45
- Software Heritage. 2022. Software Heritage. https://www.softwareheritage.org
- Collaborative filtering based recommendation of sampling methods for software defect prediction. Applied Soft Computing 90 (2020), 106163. https://doi.org/10.1016/j.asoc.2020.106163
- The Qualitas Corpus: A Curated Collection of Java Code for Empirical Studies. In 2010 Asia Pacific Software Engineering Conference. 336–345. https://doi.org/10.1109/APSEC.2010.46
- More Effective Software Repository Mining. arXiv:2008.03439 [cs.SE]