Kaggle Meta Datasets Overview
- Kaggle Meta Datasets are comprehensive archives capturing Kaggle’s competitions, notebooks, discussions, and user interactions with detailed temporal and relational metadata.
- They integrate ETL pipelines, regex-based code parsing, and monthly snapshot updates using CSV, Parquet, and JSON formats to ensure data reliability and support longitudinal analysis.
- The datasets empower investigations into technology adoption, user behavior, and competition dynamics, facilitating advanced studies like recommender system benchmarking and topic modeling.
Kaggle Meta Datasets are comprehensive, openly available archives capturing the full temporal and relational landscape of Kaggle’s data science activity, including competitions, public notebooks (kernels), discussion threads, datasets, and leaderboard histories. Their purpose is to enable rigorous empirical research on data science phenomena, workflow evolution, technological adoption trends, and community dynamics using real-world data. These datasets are distinguished by their granularity, scope, and the presence of cross-linked entity tables and code artifacts spanning 2010 through 2025.
1. Historical Context and Origins
Kaggle, established in 2010, developed a robust ecosystem centered around competitive data analysis. Over 15 years, the platform expanded beyond competitions to include user-contributed datasets, executable notebooks, forums, and models. In 2024–2025, Kaggle released two archival resources—referred to as “Meta Kaggle” and “Meta Kaggle Code”—intended to support systematic investigations into platform growth, competition strategies, code evolution, and user engagement (Bönisch et al., 9 Nov 2025). These releases mark a shift toward open, large-scale community data enabled by periodic ETL jobs on the production backend and the preservation of kernel execution traces.
2. Structure, Scope, and Schema
Kaggle Meta Datasets provide a unified relational schema for the entire platform. Entities and their relationships are encoded in CSV or Parquet files, and, for notebooks, in raw JSON. The core tables and their primary attributes are as follows:
| Table (File) | Key Entity | Example Fields |
|---|---|---|
| competitions.csv | Competition | competition_id, title, host_segment, tags |
| kernels.csv | Notebook | kernel_id, author_user_id, competition_id, code |
| discussions.csv | Discussion Thread | discussion_id, title, body, votes |
| leaderboards.csv | Submission | submission_id, kernel_id, score, rank, date |
| datasets.csv | Dataset | dataset_id, title, author_user_id, tags |
| user_profiles.csv | User | user_id, join_date, country, total_kernels |
Cross-table foreign key relationships (competitions→kernels, kernels→leaderboards, discussions→competitions, user_profiles→all) form the basis for entity linking and longitudinal studies. All date fields use ISO 8601 standard in UTC. Data normalization includes flattening nested arrays (tags, etc.) and dropping orphaned records to enforce referential integrity (Quaranta et al., 2021).
3. Data Collection, Curation, and Release Methodology
Meta Datasets are extracted from Kaggle’s production PostgreSQL database and the Dockerized kernel execution subsystem. The process includes:
- Raw export of entity tables and code artifacts (latest kernel snapshots only).
- Parsing of code cells for import statements via regex (enabling technological trend analysis).
- Normalization of user IDs, timestamps, and array fields.
- Continuous update pipeline: monthly resnapshots with a
last_snapshot_datemarker, cleaning for duplicate or deleted records, and versioning support for reproducibility (recommendation to use DVC or Git LFS for local archives). - For code and notebooks, the full .ipynb JSON is preserved, with metadata linked by notebook identifiers and user IDs (Bönisch et al., 9 Nov 2025, Quaranta et al., 2021).
4. Extended Specific Meta Datasets and Derived Resources
Noteworthy derived datasets include KGTorrent and simulation subsets used in tripartite analysis:
- KGTorrent (Quaranta et al., 2021): Contains 248,761 Jupyter notebooks (Python only), spanning five years and 2,910 competitions. Metadata is stored in a relational MySQL schema. Average record retention after cleaning is ≈91.23%. Enforced foreign-key constraints link notebooks to users and competitions. The corpus is refreshed via periodic HTTP fetches and manifest validation; orphaned records are filtered out. Coverage includes user and kernel statistics, tags, code/markdown cell counts, dependency lists, and notebook execution counts.
- Tripartite Graph Subset (Kowald et al., 2019): Extracted from Meta Kaggle, this subset models a tripartite interaction graph over Users (), Datasets (), and Services (). Three sets of bipartite interaction tables—UserDatasetInteractions, UserServiceInteractions, DatasetServiceInteractions—serve as the basis for evaluating popularity and collaborative filtering recommenders in data markets. The paper applies interaction count filters and withholds ten links per entity for test splits, resulting in tailored training/test partitions for each use case.
5. Analytical Use Cases and Representative Workflows
Meta Datasets support diverse analyses across platform growth, technology adoption, user behavior, and competition outcomes:
- Longitudinal Competition Analysis: Annual competition counts and growth rates (), facilitating time-series studies of platform expansion.
- User Onboarding and Surge Detection: Time-series of user signups analyzed via 90-day sliding-window Z-scores to identify event-driven spikes.
- Kernel Technology Trends: Extraction of import statements from notebook source code enables trend plots for library adoption, collapsing submodule imports (e.g.,
xgboost.core→xgboost). - Discussion Topic Evolution: Topic modeling (e.g., BERTopic) on discussion bodies, enabling quantification of recurring themes over 14+ years.
- Code Complexity Evolution: Aggregation of code and markdown cell counts per year provides proxy metrics for notebook sophistication.
- Recommender System Benchmarking: Tripartite and bipartite adjacency projections from Meta Kaggle enable formal accuracy evaluation of CF/popularity algorithms, with use-case-specific accuracy observations (Kowald et al., 2019).
- Dataset2Vec Meta-Feature Learning: Construction of meta-feature spaces for Kaggle tabular datasets (hierarchical set representation; DeepSet aggregation; supervised/auxiliary losses; cluster and recommendation strategies) (Jomaa et al., 2019).
6. Limitations, Caveats, and Best Practices
Several limitations documented in the literature:
- Referential integrity enforcement excludes orphaned records (i.e., notebooks with deleted authors or competitions) (Quaranta et al., 2021).
- Coverage is bounded by public submission; private or deleted content is not included.
- KGTorrent v1 includes only Python notebooks; R kernels are omitted.
- HTTP download failures are rare but present (0.004%).
- No explicit time-based slicing, normalization beyond interaction filters, or validation splits in recommender simulations (Kowald et al., 2019).
- Cleaning requires respect for deletion and rename markers (
last_snapshot_date) (Bönisch et al., 9 Nov 2025). Recommended practices: version datasets locally, employ partitioned Parquet storage for large tables, and cite exact release snapshot/date for reproducibility. Text analysis should remove HTML/markdown tags prior to NLP processing.
7. Significance and Impact on Research
Kaggle Meta Datasets constitute a primary empirical resource for the paper of real-world data science, enabling researchers to:
- Investigate the adoption timeline of algorithms, libraries, and techniques at individual, competition, and community level.
- Systematically analyze knowledge transfer, reproducibility, and code quality, using millions of cross-linked records.
- Benchmark recommender system strategies, meta-learning pipelines, and data market simulations in authentic user-item-service contexts.
- Perform large-scale topic modeling, anomaly detection, and temporal trend analysis to uncover latent dynamics and technological shifts. A plausible implication is that these datasets foster research beyond simulated or toy domains, offering an avenue for grounded, scale-sensitive hypothesis testing and methodological refinement. They are regularly updated, richly documented, and directly link code to competition outcomes, facilitating reproducibility and longitudinal analysis at unprecedented scale.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free