Papers
Topics
Authors
Recent
Search
2000 character limit reached

KGTorrent: Kaggle Notebook Corpus

Updated 23 June 2026
  • KGTorrent is an offline database of 248,761 Kaggle Python notebooks and metadata that supports reproducible research in data science workflows.
  • It employs a comprehensive MySQL relational schema linking notebooks, competitions, and user profiles to enable detailed analyses of coding patterns and library usage.
  • The dataset spans 2015 to 2020, offering robust temporal coverage for exploring the evolution of notebook practices and competition strategies.

KGTorrent is an offline corpus and metadata database of Python Jupyter notebooks sourced from Kaggle, designed to facilitate systematic academic study of computational notebook practices, user behaviors, and data science workflows at scale. Developed by Quaranta, Calefato, and Lanubile, KGTorrent provides both the full content of publicly available Kaggle Python notebooks (known as "kernels") and a reconstructed relational database of Kaggle entities, supporting research into notebook usage patterns, reproducibility, library trends, and broader issues in computational literacies (Quaranta et al., 2021).

1. Scope, Purpose, and Coverage

KGTorrent’s primary objective is to enable researchers to analyze at scale how data scientists and practitioners utilize Jupyter Notebook in the context of Kaggle competitions and datasets, eliminating the need for repeated crawling, schema inference, or partial data extractions. The dataset’s construction provides strong guarantees of reproducibility and coverage, including raw notebooks and normalized, referentially complete metadata.

Key attributes of the collection:

  • Temporal coverage: November 2015 (debut of Python “kernels” on Kaggle) through October 2020
  • Breadth: 248,761 Python Jupyter notebooks, 2,910 distinct competitions, and 5,598,921 Kaggle user accounts
  • Storage footprint: 175 GB (uncompressed), 76 GB (compressed) for notebooks; MySQL metadata dump at 8.31 GB (uncompressed), ≈1 GB (gzipped)
  • Metadata scope: Comprehensive relational schema covering competitions, datasets, user tiering, submissions, tags, and more

The intended use cases include, but are not limited to, analysis of winning notebook strategies, library usage in high-scoring notebooks, and tracking evolution of coding and documentation practices in large-scale data science environments (Quaranta et al., 2021).

2. Metadata Structure and Schema

KGTorrent’s metadata derives from a reverse-engineered version of Kaggle’s “Meta Kaggle” CSV dump, mapped into a MySQL relational schema comprising 29 tables. The schema affords precise linking between notebook files and their associated Kaggle objects, enabling multifaceted queries and data integration.

Selected schema elements:

Table Key Fields Description
Kernels KernelId (PK), UserId, CompetitionId, Title, Language, KernelType, CreationDate, License, IsPrivate, TotalComments, Upvotes, Downvotes, PublicScore, PrivateScore, Score Each row is a notebook (kernel); allows filtering by user, competition, language, privacy, etc.
KernelVersions VersionId (PK), KernelId, SourceCodeSize, NumCodeCells, NumMarkdownCells, ExecutionTime, NumOutputBytes Multiple historical versions per notebook; enables longitudinal/temporal analysis
Users UserId (PK), UserName, Country, Organization, MedalCounts, Tier fields (CompetitionTier, NotebookTier, etc.) User profile, expertise stratification, and achievement summary
Competitions CompetitionId (PK), Title, HostSegment, StartDate, EndDate, Reward Contextualizes notebook efforts within competitive events
Tag tables KernelTags, CompetitionTags TagId links to KernelId or CompetitionId; support topical and domain filtering

Controlled vocabularies enforce internal consistency, e.g., Language ∈ {‘python’, ‘r’}; Tier ∈ {‘Novice’, ‘Contributor’, ‘Expert’, ‘Master’, ‘Grandmaster’}; KernelType ∈ {‘notebook’, ‘script’}.

3. Construction Pipeline and Integrity

KGTorrent’s construction process involves several rigorously defined stages:

  • Meta Kaggle consumption: The daily “Meta Kaggle” CSV export is the canonical source. Reverse engineering leverages Kaggle’s naming conventions (“Id” suffix for keys; “Date” for timestamps) and community documentation to restore foreign-key relationships deterministically.
  • Relational import: SQLAlchemy scripts establish all tables, primary and foreign key constraints. Referential integrity is strictly enforced, resulting in ≈8.8% average data loss (e.g., 7.7% of kernels) due to unresolvable keys.
  • Notebook acquisition: For Python notebooks, the definitive download query retrieves the latest version’s URL and user association, filtering by k.Language='python'. Two modes are supported:
    • HTTP ("Download" emulation): Preserves output cells, desired for studies of result-reproducibility; throttled to ≈50 requests/s, negligible rate-limit errors (0.004% notebooks fail).
    • API (kaggle kernels pull): Faster, but omits outputs.
  • File organization: Each notebook is stored as <UserName>_<UrlSlug>.ipynb under a common root directory, simplifying programmatic access.

4. Refresh, Consistency, and Update Mechanisms

KGTorrent includes a built-in automated refresh and synchronization utility, kgtorrent.py, providing:

  • Initialization (init): One-time dump of all notebooks and metadata.
  • Refresh (refresh): Delta-based update, aligned with newer Meta Kaggle snapshots, with atomic consistency for both file and database state. The pseudocode logic entails:
    • Identifying new and deleted KernelIds
    • Downloading added notebooks
    • Removing obsolete files
    • Swapping MySQL dumps and notebook trees in a single atomic step

This process allows for efficient repeated updates (weekly/monthly), suitable for integration into CI pipelines or cron jobs. Explicit safeguards against duplicated or partial data states are enforced (Quaranta et al., 2021).

5. Example Usage Patterns

KGTorrent is amenable to standard workflow tools favored by the data science community.

Example: Querying for competition-specific notebooks using SQLAlchemy and Pandas

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from sqlalchemy import create_engine, MetaData, Table
import pandas as pd

engine = create_engine('mysql+pymysql://user:pw@localhost/kgtorrent')
metadata = MetaData(bind=engine)
Kernels = Table('Kernels', metadata, autoload_with=engine)

sql = """
SELECT k.KernelId, u.UserName, k.CurrentUrlSlug, k.Score
FROM Kernels k
  JOIN Users u ON k.UserId = u.UserId
WHERE k.CompetitionId = :cid
"
df = pd.read_sql(sql, engine, params={'cid':1234})
df['notebook_path'] = df.apply(
  lambda row: f"notebooks/{row.UserName}_{row.CurrentUrlSlug}.ipynb", axis=1)

Example: Analyzing import frequencies for sklearn usage

1
2
3
4
5
6
7
8
9
10
11
12
13
import nbformat, re
from collections import Counter

counter = Counter()
for path in df['notebook_path'][:100]:
  nb = nbformat.read(path, as_version=4)
  for cell in nb.cells:
    if cell.cell_type=='code':
      for line in cell.source.splitlines():
        m = re.match(r'\s*from\s+sklearn\.([^ ]+)\s+import', line)
        if m: counter[m.group(1)] += 1

print(counter.most_common(10))

Dynamic notebook downloading, if pre-fetching is omitted, is supported by a concise Python function using HTTP GET requests to Kaggle endpoints.

6. Dataset Statistics and Analytical Insights

While Quaranta et al. do not provide field-level statistics, they supply procedures to compute key aggregations. For example, the mean and median number of code cells per notebook (using Pandas over the KernelVersions table for latest versions):

Let NN be the number of notebooks and xix_i their number of code cells: μ=1Ni=1Nxi\mu = \frac{1}{N}\sum_{i=1}^N x_i

s2=1N1i=1N(xiμ)2s^2 = \frac{1}{N-1}\sum_{i=1}^N (x_i - \mu)^2

Trends established on similar corpora and verifiable within KGTorrent include:

  • Python dominates (≈96% of notebooks), overtaking R.
  • Median notebook size exhibits periodic growth, correlating with increasing narrative/visualization content.
  • High-medal-tier users tend to author notebooks characterized by modularity (fewer, larger cells) and reduced external library dependence.

Distributional analyses are also supported for fields such as SourceCodeSize and ExecutionTime (Quaranta et al., 2021).

7. Research Applications and Extensions

KGTorrent’s scale, granularity, and linkage capabilities enable diverse inquiry:

  • Reproducibility studies: Applying methodologies such as Osiris [2] to assess the re-execution reliability of notebooks with embedded outputs.
  • Behavioral analytics: Mining correlations between notebook structures and user tier progression or competition outcomes.
  • Librarian studies: Charting adoption rates and usage patterns for frameworks (e.g., TensorFlow, PyTorch), and AutoML systems.
  • Software quality evaluation: Detection of code duplication [3], technical debt [4], code smells, and patterns of reuse within a competition-centric context.
  • Tool enhancement: Construction of recommenders for data preparation [5] or environment restoration [6], leveraging observed “ground-truth” practices.
  • Versioned analysis: Studying temporal evolution of notebooks akin to Stack Overflow’s SOTorrent dataset [7].

A plausible implication is that KGTorrent, by analogy to GHTorrent [7], serves as a foundational infrastructure for empirical “data science on data science.”


References:

(Quaranta et al., 2021) Quaranta et al., "KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle"

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KGTorrent.