Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CAM: A Collection of Snapshots of GitHub Java Repositories Together with Metrics (2403.08488v1)

Published 13 Mar 2024 in cs.SE

Abstract: Even though numerous researchers require stable datasets along with source code and basic metrics calculated on them, neither GitHub nor any other code hosting platform provides such a resource. Consequently, each researcher must download their own data, compute the necessary metrics, and then publish the dataset somewhere to ensure it remains accessible indefinitely. Our CAM (stands for ``Classes and Metrics'') project addresses this need. It is an open-source software capable of cloning Java repositories from GitHub, filtering out unnecessary files, parsing Java classes, and computing metrics such as Cyclomatic Complexity, Halstead Effort and Volume, C&K metrics, Maintainability Metrics, LCOM5 and HND, as well as some Git-based Metrics. At least once a year, we execute the entire script, a process which requires a minimum of ten days on a very powerful server, to generate a new dataset. Subsequently, we publish it on Amazon S3, thereby ensuring its availability as a reference for researchers. The latest archive of 2.2Gb that we published on the 2nd of March, 2024 includes 532K Java classes with 48 metrics for each class.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (10)
  1. James M. Bieman and Byung-Kyoo Kang. 1995. Cohesion and Reuse in an Object-Oriented System. SIGSOFT Software Engineering Notes 20, SI (1995), 259–262. https://doi.org/10.1145/223427.211856
  2. G. Ann Campbell. 2018. Cognitive Complexity: An Overview and Evaluation. In Proceedings of the International Conference on Technical Debt. 57–58. https://doi.org/10.1145/3194164.3194186
  3. A Systematic Mapping Study of Software Development With GitHub. IEEE Access 5 (2017), 7173–7192. https://doi.org/10.1109/ACCESS.2017.2682323
  4. The Interpretation and Utility of Three Cohesion Metrics for Object-Oriented Design. ACM Transactions on Software Engineering and Methodology (TOSEM) 15, 2 (2006), 123–149. https://doi.org/10.1145/1131421.1131422
  5. Kyle Daigle. 2023. Octoverse: The state of open source and rise of AI in 2023. https://github.blog/2023-11-08-the-state-of-open-source-and-ai/. [Online; accessed 13-03-2024].
  6. Thomas Dohmke. 2023. 100 million developers and counting. https://github.blog/2023-01-25-100-million-developers-and-counting/. [Online; accessed 13-03-2024].
  7. Coupling and Cohesion (Towards a Valid Metrics Suite for Object-Oriented Analysis and Design). Object Oriented Systems 3, 3 (1996), 143–158.
  8. Thomas J. McCabe. 1976. A Complexity Measure. IEEE Transactions on Software Engineering 4 (1976), 308–320. https://doi.org/10.1109/TSE.1976.233837
  9. Source Code Metrics: A Systematic Mapping Study. Journal of Systems and Software 128 (2017), 164–197. https://doi.org/10.1016/j.jss.2017.03.044
  10. Gregorio Robles. 2010. Replicating MSR: A Study of the Potential Replicability of Papers Published in the Mining Software Repositories Proceedings. In IEEE Working Conference on Mining Software Repositories. 171–180. https://doi.org/10.1109/MSR.2010.5463348
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Yegor Bugayenko (12 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com