CAM: A Collection of Snapshots of GitHub Java Repositories Together with Metrics (2403.08488v1)
Abstract: Even though numerous researchers require stable datasets along with source code and basic metrics calculated on them, neither GitHub nor any other code hosting platform provides such a resource. Consequently, each researcher must download their own data, compute the necessary metrics, and then publish the dataset somewhere to ensure it remains accessible indefinitely. Our CAM (stands for ``Classes and Metrics'') project addresses this need. It is an open-source software capable of cloning Java repositories from GitHub, filtering out unnecessary files, parsing Java classes, and computing metrics such as Cyclomatic Complexity, Halstead Effort and Volume, C&K metrics, Maintainability Metrics, LCOM5 and HND, as well as some Git-based Metrics. At least once a year, we execute the entire script, a process which requires a minimum of ten days on a very powerful server, to generate a new dataset. Subsequently, we publish it on Amazon S3, thereby ensuring its availability as a reference for researchers. The latest archive of 2.2Gb that we published on the 2nd of March, 2024 includes 532K Java classes with 48 metrics for each class.
- James M. Bieman and Byung-Kyoo Kang. 1995. Cohesion and Reuse in an Object-Oriented System. SIGSOFT Software Engineering Notes 20, SI (1995), 259–262. https://doi.org/10.1145/223427.211856
- G. Ann Campbell. 2018. Cognitive Complexity: An Overview and Evaluation. In Proceedings of the International Conference on Technical Debt. 57–58. https://doi.org/10.1145/3194164.3194186
- A Systematic Mapping Study of Software Development With GitHub. IEEE Access 5 (2017), 7173–7192. https://doi.org/10.1109/ACCESS.2017.2682323
- The Interpretation and Utility of Three Cohesion Metrics for Object-Oriented Design. ACM Transactions on Software Engineering and Methodology (TOSEM) 15, 2 (2006), 123–149. https://doi.org/10.1145/1131421.1131422
- Kyle Daigle. 2023. Octoverse: The state of open source and rise of AI in 2023. https://github.blog/2023-11-08-the-state-of-open-source-and-ai/. [Online; accessed 13-03-2024].
- Thomas Dohmke. 2023. 100 million developers and counting. https://github.blog/2023-01-25-100-million-developers-and-counting/. [Online; accessed 13-03-2024].
- Coupling and Cohesion (Towards a Valid Metrics Suite for Object-Oriented Analysis and Design). Object Oriented Systems 3, 3 (1996), 143–158.
- Thomas J. McCabe. 1976. A Complexity Measure. IEEE Transactions on Software Engineering 4 (1976), 308–320. https://doi.org/10.1109/TSE.1976.233837
- Source Code Metrics: A Systematic Mapping Study. Journal of Systems and Software 128 (2017), 164–197. https://doi.org/10.1016/j.jss.2017.03.044
- Gregorio Robles. 2010. Replicating MSR: A Study of the Potential Replicability of Papers Published in the Mining Software Repositories Proceedings. In IEEE Working Conference on Mining Software Repositories. 171–180. https://doi.org/10.1109/MSR.2010.5463348
- Yegor Bugayenko (12 papers)