Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 43 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 16 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 464 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding (2503.02951v1)

Published 4 Mar 2025 in cs.LG, cs.AI, and cs.CL

Abstract: We introduce KodCode, a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data across diverse difficulties and domains for training LLMs for coding. Existing code-focused resources typically fail to ensure either the breadth of coverage (e.g., spanning simple coding tasks to advanced algorithmic problems) or verifiable correctness (e.g., unit tests). In contrast, KodCode comprises question-solution-test triplets that are systematically validated via a self-verification procedure. Our pipeline begins by synthesizing a broad range of coding questions, then generates solutions and test cases with additional attempts allocated to challenging problems. Finally, post-training data synthesis is done by rewriting questions into diverse formats and generating responses under a test-based reject sampling procedure from a reasoning model (DeepSeek R1). This pipeline yields a large-scale, robust and diverse coding dataset. KodCode is suitable for supervised fine-tuning and the paired unit tests also provide great potential for RL tuning. Fine-tuning experiments on coding benchmarks (HumanEval(+), MBPP(+), BigCodeBench, and LiveCodeBench) demonstrate that KodCode-tuned models achieve state-of-the-art performance, surpassing models like Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Llama-70B.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces KodCode, a synthetic dataset of 447K verified question-solution-test triplets generated via a three-step pipeline for training coding language models.
  • KodCode features extensive diversity, systematic verification using iterative attempts, low contamination, and analyses covering difficulty and deduplication.
  • Fine-tuning experiments show KodCode-tuned models achieve state-of-the-art results on major coding benchmarks like HumanEval and MBPP, outperforming existing large models.

This paper presents a synthetic dataset for coding LLMs that systematically combines diverse question generation, solution synthesis, and unit test verification.

  • The authors design a three-step pipeline—question synthesis from multiple sources, self-verification with up to 10 iterative attempts, and post-training style conversion—to produce 447K verified question-solution-test triplets.
  • Comprehensive analyses cover token length distributions, deduplication using FAISS, difficulty categorizations, and minimal contamination with established benchmarks.
  • Fine-tuning experiments show that KodCode-tuned models achieve state-of-the-art scores on benchmarks like HumanEval(+), MBPP(+), BigCodeBench, and LiveCodeBench, surpassing models such as Qwen2.5-Coder and DeepSeek-R1-Distill-Llama-70B.
Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube