Boidae: Your Personal Mining Platform (2401.11092v1)
Abstract: Mining software repositories is a useful technique for researchers and practitioners to see what software developers actually do when developing software. Tools like Boa provide users with the ability to easily mine these open-source software repositories at a very large scale, with datasets containing hundreds of thousands of projects. The trade-off is that users must use the provided infrastructure, query language, runtime, and datasets and this might not fit all analysis needs. In this work, we present Boidae: a family of Boa installations controlled and customized by users. Boidae uses automation tools such as Ansible and Docker to facilitate the deployment of a customized Boa installation. In particular, Boidae allows the creation of custom datasets generated from any set of Git repositories, with helper scripts to aid in finding and cloning repositories from GitHub and SourceForge. In this paper, we briefly describe the architecture of Boidae and how researchers can utilize the infrastructure to generate custom datasets. Boidae's scripts and all infrastructure it builds upon are open-sourced. A video demonstration of Boidae's installation and extension is available at https://go.unl.edu/boidae.
- ACCESS “Advanced Cyberinfrastructure Coordination Ecosystem: Services and Support”, 2023 URL: https://access-ci.org/
- “Boa meets Python: A Boa dataset of data science software in Python language” In Proceedings of the 16th International Conference on Mining Software Repositories, MSR ’19, 2019
- CloudLab “The CloudLab website”, 2023 URL: http://www.cloudlab.us/
- “Boa: A language and infrastructure for analyzing ultra-large-scale software repositories” In Proceedings of the 35th International Conference on Software Engineering, ICSE ’13, 2013, pp. 422–431 DOI: 10.1109/ICSE.2013.6606588
- “Boa: ultra-large-scale software repository and source-code mining” In ACM Transactions on Software Engineering and Methodology 25.1, 2015, pp. 7:1–7:34
- Georgios Gousios “The GHTorrent dataset and tool suite” In Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13 San Francisco, CA, USA: IEEE Press, 2013, pp. 233–236
- “GHTorrent: GitHub’s data from a firehose” In Proceedings of the 9th Working Conference on Mining Software Repositories, MSR IEEE, 2012, pp. 12–21
- Ali M. Keshk and Robert Dyer “Method chaining redux: An empirical study of method chaining in Java, Kotlin, and Python” In Proceedings of the 20th International Conference on Mining Software Repositories, MSR ’23’, 2023, pp. 546–557
- “Sourcerer: mining and searching internet-scale software repositories” In Data Mining and Knowledge Discovery 18.2, 2009, pp. 300–336 DOI: 10.1007/s10618-008-0118-x
- “An empirical study of method chaining in Java” In Proceedings of the 17th International Conference on Mining Software Repositories, MSR ’20 Seoul, Republic of Korea: Association for Computing Machinery, 2020, pp. 93–102 DOI: 10.1145/3379597.3387441
- “Boa website”, 2024 URL: http://boa.cs.iastate.edu/boa/
- Amazon Web Services “AWS Global Infrastructure”, 2023 URL: https://aws.amazon.com/about-aws/global-infrastructure
- Davide Spadini, Maurício Aniche and Alberto Bacchelli “PyDriller: Python framework for mining software repositories” In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE ’18 New York, NY, USA: ACM, 2018, pp. 908–911 DOI: 10.1145/3236024.3264598