Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Tool to Extract Structured Data from GitHub (2012.03453v1)

Published 7 Dec 2020 in cs.SE

Abstract: GitHub repositories consist of various detailed information about the project contributors, the number of commits and its contributors, releases, pull requests, programming languages, and issues. However, no systematic dataset of open source projects exists which features detailed information about the repositories on GitHub for knowledge acquisition and mining. In this paper, we developed tool support, named GitRepository, which helps in creating a data-set of repositories based on the proposed schema. Out of initial 1680 repositories, the dataset hosts 620 repositories (with applied basic filters of stars and forks), and 247 repositories (after applying all pre-defined filters). The tool extracts the information of GitHub repositories and saves the data in the form of CSV. files and a database (.DB) file.

Citations (4)

Summary

We haven't generated a summary for this paper yet.