KGTorrent: Kaggle Notebook Corpus
- KGTorrent is an offline database of 248,761 Kaggle Python notebooks and metadata that supports reproducible research in data science workflows.
- It employs a comprehensive MySQL relational schema linking notebooks, competitions, and user profiles to enable detailed analyses of coding patterns and library usage.
- The dataset spans 2015 to 2020, offering robust temporal coverage for exploring the evolution of notebook practices and competition strategies.
KGTorrent is an offline corpus and metadata database of Python Jupyter notebooks sourced from Kaggle, designed to facilitate systematic academic study of computational notebook practices, user behaviors, and data science workflows at scale. Developed by Quaranta, Calefato, and Lanubile, KGTorrent provides both the full content of publicly available Kaggle Python notebooks (known as "kernels") and a reconstructed relational database of Kaggle entities, supporting research into notebook usage patterns, reproducibility, library trends, and broader issues in computational literacies (Quaranta et al., 2021).
1. Scope, Purpose, and Coverage
KGTorrent’s primary objective is to enable researchers to analyze at scale how data scientists and practitioners utilize Jupyter Notebook in the context of Kaggle competitions and datasets, eliminating the need for repeated crawling, schema inference, or partial data extractions. The dataset’s construction provides strong guarantees of reproducibility and coverage, including raw notebooks and normalized, referentially complete metadata.
Key attributes of the collection:
- Temporal coverage: November 2015 (debut of Python “kernels” on Kaggle) through October 2020
- Breadth: 248,761 Python Jupyter notebooks, 2,910 distinct competitions, and 5,598,921 Kaggle user accounts
- Storage footprint: 175 GB (uncompressed), 76 GB (compressed) for notebooks; MySQL metadata dump at 8.31 GB (uncompressed), ≈1 GB (gzipped)
- Metadata scope: Comprehensive relational schema covering competitions, datasets, user tiering, submissions, tags, and more
The intended use cases include, but are not limited to, analysis of winning notebook strategies, library usage in high-scoring notebooks, and tracking evolution of coding and documentation practices in large-scale data science environments (Quaranta et al., 2021).
2. Metadata Structure and Schema
KGTorrent’s metadata derives from a reverse-engineered version of Kaggle’s “Meta Kaggle” CSV dump, mapped into a MySQL relational schema comprising 29 tables. The schema affords precise linking between notebook files and their associated Kaggle objects, enabling multifaceted queries and data integration.
Selected schema elements:
| Table | Key Fields | Description |
|---|---|---|
| Kernels | KernelId (PK), UserId, CompetitionId, Title, Language, KernelType, CreationDate, License, IsPrivate, TotalComments, Upvotes, Downvotes, PublicScore, PrivateScore, Score | Each row is a notebook (kernel); allows filtering by user, competition, language, privacy, etc. |
| KernelVersions | VersionId (PK), KernelId, SourceCodeSize, NumCodeCells, NumMarkdownCells, ExecutionTime, NumOutputBytes | Multiple historical versions per notebook; enables longitudinal/temporal analysis |
| Users | UserId (PK), UserName, Country, Organization, MedalCounts, Tier fields (CompetitionTier, NotebookTier, etc.) | User profile, expertise stratification, and achievement summary |
| Competitions | CompetitionId (PK), Title, HostSegment, StartDate, EndDate, Reward | Contextualizes notebook efforts within competitive events |
| Tag tables | KernelTags, CompetitionTags | TagId links to KernelId or CompetitionId; support topical and domain filtering |
Controlled vocabularies enforce internal consistency, e.g., Language ∈ {‘python’, ‘r’}; Tier ∈ {‘Novice’, ‘Contributor’, ‘Expert’, ‘Master’, ‘Grandmaster’}; KernelType ∈ {‘notebook’, ‘script’}.
3. Construction Pipeline and Integrity
KGTorrent’s construction process involves several rigorously defined stages:
- Meta Kaggle consumption: The daily “Meta Kaggle” CSV export is the canonical source. Reverse engineering leverages Kaggle’s naming conventions (“Id” suffix for keys; “Date” for timestamps) and community documentation to restore foreign-key relationships deterministically.
- Relational import: SQLAlchemy scripts establish all tables, primary and foreign key constraints. Referential integrity is strictly enforced, resulting in ≈8.8% average data loss (e.g., 7.7% of kernels) due to unresolvable keys.
- Notebook acquisition: For Python notebooks, the definitive download query retrieves the latest version’s URL and user association, filtering by
k.Language='python'. Two modes are supported:- HTTP ("Download" emulation): Preserves output cells, desired for studies of result-reproducibility; throttled to ≈50 requests/s, negligible rate-limit errors (0.004% notebooks fail).
- API (
kaggle kernels pull): Faster, but omits outputs.
- File organization: Each notebook is stored as
<UserName>_<UrlSlug>.ipynbunder a common root directory, simplifying programmatic access.
4. Refresh, Consistency, and Update Mechanisms
KGTorrent includes a built-in automated refresh and synchronization utility, kgtorrent.py, providing:
- Initialization (
init): One-time dump of all notebooks and metadata. - Refresh (
refresh): Delta-based update, aligned with newer Meta Kaggle snapshots, with atomic consistency for both file and database state. The pseudocode logic entails:- Identifying new and deleted KernelIds
- Downloading added notebooks
- Removing obsolete files
- Swapping MySQL dumps and notebook trees in a single atomic step
This process allows for efficient repeated updates (weekly/monthly), suitable for integration into CI pipelines or cron jobs. Explicit safeguards against duplicated or partial data states are enforced (Quaranta et al., 2021).
5. Example Usage Patterns
KGTorrent is amenable to standard workflow tools favored by the data science community.
Example: Querying for competition-specific notebooks using SQLAlchemy and Pandas
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
from sqlalchemy import create_engine, MetaData, Table import pandas as pd engine = create_engine('mysql+pymysql://user:pw@localhost/kgtorrent') metadata = MetaData(bind=engine) Kernels = Table('Kernels', metadata, autoload_with=engine) sql = """ SELECT k.KernelId, u.UserName, k.CurrentUrlSlug, k.Score FROM Kernels k JOIN Users u ON k.UserId = u.UserId WHERE k.CompetitionId = :cid " df = pd.read_sql(sql, engine, params={'cid':1234}) df['notebook_path'] = df.apply( lambda row: f"notebooks/{row.UserName}_{row.CurrentUrlSlug}.ipynb", axis=1) |
Example: Analyzing import frequencies for sklearn usage
1 2 3 4 5 6 7 8 9 10 11 12 13 |
import nbformat, re from collections import Counter counter = Counter() for path in df['notebook_path'][:100]: nb = nbformat.read(path, as_version=4) for cell in nb.cells: if cell.cell_type=='code': for line in cell.source.splitlines(): m = re.match(r'\s*from\s+sklearn\.([^ ]+)\s+import', line) if m: counter[m.group(1)] += 1 print(counter.most_common(10)) |
Dynamic notebook downloading, if pre-fetching is omitted, is supported by a concise Python function using HTTP GET requests to Kaggle endpoints.
6. Dataset Statistics and Analytical Insights
While Quaranta et al. do not provide field-level statistics, they supply procedures to compute key aggregations. For example, the mean and median number of code cells per notebook (using Pandas over the KernelVersions table for latest versions):
Let be the number of notebooks and their number of code cells:
Trends established on similar corpora and verifiable within KGTorrent include:
- Python dominates (≈96% of notebooks), overtaking R.
- Median notebook size exhibits periodic growth, correlating with increasing narrative/visualization content.
- High-medal-tier users tend to author notebooks characterized by modularity (fewer, larger cells) and reduced external library dependence.
Distributional analyses are also supported for fields such as SourceCodeSize and ExecutionTime (Quaranta et al., 2021).
7. Research Applications and Extensions
KGTorrent’s scale, granularity, and linkage capabilities enable diverse inquiry:
- Reproducibility studies: Applying methodologies such as Osiris [2] to assess the re-execution reliability of notebooks with embedded outputs.
- Behavioral analytics: Mining correlations between notebook structures and user tier progression or competition outcomes.
- Librarian studies: Charting adoption rates and usage patterns for frameworks (e.g., TensorFlow, PyTorch), and AutoML systems.
- Software quality evaluation: Detection of code duplication [3], technical debt [4], code smells, and patterns of reuse within a competition-centric context.
- Tool enhancement: Construction of recommenders for data preparation [5] or environment restoration [6], leveraging observed “ground-truth” practices.
- Versioned analysis: Studying temporal evolution of notebooks akin to Stack Overflow’s SOTorrent dataset [7].
A plausible implication is that KGTorrent, by analogy to GHTorrent [7], serves as a foundational infrastructure for empirical “data science on data science.”
References:
(Quaranta et al., 2021) Quaranta et al., "KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle"