CAM: A Collection of Snapshots of GitHub Java Repositories Together with Metrics
Abstract: Even though numerous researchers require stable datasets along with source code and basic metrics calculated on them, neither GitHub nor any other code hosting platform provides such a resource. Consequently, each researcher must download their own data, compute the necessary metrics, and then publish the dataset somewhere to ensure it remains accessible indefinitely. Our CAM (stands for ``Classes and Metrics'') project addresses this need. It is an open-source software capable of cloning Java repositories from GitHub, filtering out unnecessary files, parsing Java classes, and computing metrics such as Cyclomatic Complexity, Halstead Effort and Volume, C&K metrics, Maintainability Metrics, LCOM5 and HND, as well as some Git-based Metrics. At least once a year, we execute the entire script, a process which requires a minimum of ten days on a very powerful server, to generate a new dataset. Subsequently, we publish it on Amazon S3, thereby ensuring its availability as a reference for researchers. The latest archive of 2.2Gb that we published on the 2nd of March, 2024 includes 532K Java classes with 48 metrics for each class.
- James M. Bieman and Byung-Kyoo Kang. 1995. Cohesion and Reuse in an Object-Oriented System. SIGSOFT Software Engineering Notes 20, SI (1995), 259â262. https://doi.org/10.1145/223427.211856
- G. Ann Campbell. 2018. Cognitive Complexity: An Overview and Evaluation. In Proceedings of the International Conference on Technical Debt. 57â58. https://doi.org/10.1145/3194164.3194186
- A Systematic Mapping Study of Software Development With GitHub. IEEE Access 5 (2017), 7173â7192. https://doi.org/10.1109/ACCESS.2017.2682323
- The Interpretation and Utility of Three Cohesion Metrics for Object-Oriented Design. ACM Transactions on Software Engineering and Methodology (TOSEM) 15, 2 (2006), 123â149. https://doi.org/10.1145/1131421.1131422
- Kyle Daigle. 2023. Octoverse: The state of open source and rise of AI in 2023. https://github.blog/2023-11-08-the-state-of-open-source-and-ai/. [Online; accessed 13-03-2024].
- Thomas Dohmke. 2023. 100 million developers and counting. https://github.blog/2023-01-25-100-million-developers-and-counting/. [Online; accessed 13-03-2024].
- Coupling and Cohesion (Towards a Valid Metrics Suite for Object-Oriented Analysis and Design). Object Oriented Systems 3, 3 (1996), 143â158.
- Thomas J. McCabe. 1976. A Complexity Measure. IEEE Transactions on Software Engineering 4 (1976), 308â320. https://doi.org/10.1109/TSE.1976.233837
- Source Code Metrics: A Systematic Mapping Study. Journal of Systems and Software 128 (2017), 164â197. https://doi.org/10.1016/j.jss.2017.03.044
- Gregorio Robles. 2010. Replicating MSR: A Study of the Potential Replicability of Papers Published in the Mining Software Repositories Proceedings. In IEEE Working Conference on Mining Software Repositories. 171â180. https://doi.org/10.1109/MSR.2010.5463348
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.