Papers
Topics
Authors
Recent
Search
2000 character limit reached

KodCode: Synthetic Coding Dataset

Updated 2 February 2026
  • KodCode is a fully synthetic dataset offering 447K validated Python prompt-solution-test triplets for training code LLMs.
  • It spans a wide range of coding difficulties, from simple exercises to contest-level challenges verified through automated testing.
  • Fine-tuning with KodCode demonstrates state-of-the-art gains on major coding benchmarks, highlighting its research significance.

KodCode is a large-scale, fully synthetic dataset purpose-built for training and evaluating LLMs on coding tasks of varying difficulty and domains. Each data point consists of a triplet: a natural-language or Python-completion prompt, a solution function in Python, and a corresponding pytest-style unit test suite, all systematically validated via automated self-verification. KodCode addresses enduring bottlenecks in code-focused LLM research, namely the need for both breadth—spanning simple tasks to algorithmic challenges—and verifiable correctness through explicit test cases. The dataset comprises 447,000 such triplets, organized across “easy,” “medium,” and “hard” difficulty labels based on empirical test-passing statistics, with negligible contamination relative to standard evaluation sets. Fine-tuning with KodCode demonstrably advances open-source state of the art on major coding benchmarks, and the dataset is primed for both supervised instruction fine-tuning and reinforcement learning optimization paradigms (Xu et al., 4 Mar 2025).

1. Dataset Construction and Structure

KodCode consists of 447,000 “question–solution–test” triplets, each verified to ensure both semantic diversity and code correctness. Each triplet is defined as:

  • a prompt qq (either in natural language or Python-completion format),
  • a solution function sol\mathrm{sol} (Python),
  • a pytest-style unit test suite test\mathrm{test}.

Solution and test blocks are demarcated by explicit delimiters (“<Solution Begin>/<Solution End>” and “<Test Begin>/<Test End>”), ensuring the LLM models the joint distribution

P(q,sol,test).P(q,\mathrm{sol},\mathrm{test})\,.

The triplets span a spectrum from basic one‐liner Python exercises to complex, contest-level algorithmic tasks, and are evenly distributed across three empirical difficulty levels.

Component Content Delimiter
Prompt Natural language or Python-completion instruction
Solution Python function implementing the solution <Solution Begin>/<End>
Test Suite Pytest-style unit tests for verifying correctness <Test Begin>/<End>

KodCode’s breadth is enabled by diverse sourcing (see Section 2), structured post-processing, and systematic verification procedures.

2. Data Sources and Synthesis Pipeline

KodCode synthesizes its content from twelve diverse input streams:

  • Magpie-Prefill: simple tasks initiated via “Write a function to...” prompts and Qwen2.5-Coder-7B sampling.
  • Seed corpora: assessment-style problems mirrored from LeetCode, Codeforces, APPS, TACO, and Code Contests using GPT-4o.
  • Data structure and algorithmic snippets: drawn from public Python DSA repositories, paired with LLM-generated novel challenges.
  • Docs questions: generated from real package documentation (Flask, Pandas, PyTorch, scikit-learn, Seaborn) with an abstention signal (“BAD_DOCUMENT”) for insufficient coverage.
  • Magpie-filtered LLM outputs: synthesis using seven open-source LLMs; prompt tagging for “Algorithm Implementation” or “Function Generation” utilizes Llama-3.1-8B.

Redundant or near-duplicate questions are aggressively removed by semantic deduplication using all-mpnet-base-v2 embeddings and FAISS, reducing volume by over 25%. The synthesis pipeline is organized in three primary steps:

Step 1: Question Generation—Diversity is achieved through aforementioned sources. Step 2: Solution and Test Case Generation—GPT-4o-0513 generates candidate solutions and unit tests, verified via sandboxed execution. Step 3: Post-Training Data Synthesis—Verified triplets are converted to Python completion prompts or instruction–response pairs (chain-of-thought from DeepSeek-R1), increasing format and reasoning diversity.

3. Verification, Difficulty Labeling, and Quality Control

KodCode implements rigorous self-verification for correctness:

  • Each question qq and proposed (sol,test)(\mathrm{sol},\mathrm{test}) pair is tested in a sandbox environment.
  • Unsuccessful triplets are re-sampled up to n=10n=10 times—a reject sampling technique that increases Pass@kk rates. For example, increasing attempt count from 1→5 boosts Pass@kk by ≈ 20%, with a further ≈ 4% gain for 5→10.
  • The Pass@kk metric is employed:

Pass@k=1Ni=1N1(j=1kpassesij)\mathrm{Pass@}k = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\Bigl(\bigvee_{j=1}^k \mathrm{passes}_{ij}\Bigr)

  • Examples with irrecoverable test failures (≈ 1.5% in MBPP-validated set) are excluded.
  • MBPP validation: 80 of 90 generated solutions passed self-verification; 78/80 (97.5%) then passed human test suites, yielding an end-to-end error rate < 2.5%.

Difficulty labels are derived empirically:

  • “easy” if Pass@10 > 2/3,
  • “medium” if 1/3 < Pass@10 ≤ 2/3,
  • “hard” if Pass@10 < 1/3.

Contamination with HumanEval, MBPP, BigCodeBench, and LiveCodeBench is negligible—only 94 of 447K examples exceed cosine similarity > 0.95 and are excluded from evaluation.

4. Supervised and Reinforcement Learning Usability

KodCode is engineered for compatibility with both supervised fine-tuning and reinforcement learning (RL) regimes:

  • Supervised Fine-Tuning: The paired question–solution–test structure enables standard SFT training. Post-synthesis conversion generates a sub-corpus KodCode-SFT with Python completion and instruction–response (chain-of-thought) formats.
  • Reinforcement Learning: Each example’s solution is coupled with machine-executable unit tests, permitting RL algorithms (e.g., PPO, GRPO, REINFORCE++) to utilize test pass rates as reward signals for policy optimization.

A plausible implication is that KodCode directly supports research into self-improving and verifiably robust code LLMs at scale through fully automated validation pipelines.

5. Benchmarking and Performance Outcomes

Fine-tuning competitive open-source models (e.g., Qwen2.5-Coder-32B-Instruct) with KodCode-SFT sets new state-of-the-art results across several coding benchmarks:

Benchmark KodCode-32B-50K (%) Qwen2.5-Coder-32B (%) Margin (%)
HumanEval (base) 92.7 90.9 +1.8
BigCodeBench-C Full 59.8 57.6 +2.2
BigCodeBench-C Hard 37.8 31.1 +6.7
BigCodeBench-I Full 51.1 49.4 +1.7
BigCodeBench-I Hard 32.4 25.7 +6.7
LiveCodeBench-Easy 87.8 86.7 +1.1
LiveCodeBench-Medium 35.9 35.9 0
LiveCodeBench-Hard 6.7 8.5 -1.8

Averaged across all sub-metrics, KodCode-32B-50K attains 61.22%, outperforming all prior open-source and even larger reasoning models on these tasks.

6. Limitations and Future Directions

KodCode’s design facilitates extensive experimentation with diverse, verifiable and instructive coding data, but some constraints remain:

  • The “hardest” tiers, especially in LiveCodeBench-Hard, are not yet fully matched—suggesting opportunities for scaling toward even higher-difficulty, contest-level problem synthesis.
  • Post-training RL routines, while enabled by unit tests, present open challenges in optimal reward shaping and curriculum design.
  • Future efforts are slated to address (1) the synthesis of even harder, contest-grade problems, (2) development of automated data-selection heuristics to optimize post-training mixture, and (3) the assembly of curriculum-style, repository-level datasets.

KodCode’s empirical validation, scale, and systematic correctness verification support its ongoing use and extension in research on robust, self-certifying code LLMs (Xu et al., 4 Mar 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KodCode.