CodeExercises: LeetCode Benchmark

Updated 9 December 2025

CodeExercises Dataset is a curated repository of 2,869 Python problems from LeetCode, enriched with detailed metadata and balanced difficulty tiers.
It uses a rigorous multi-stage pipeline to generate over 100 test cases per problem, ensuring robust and contamination-free evaluations.
The dataset supports both small-scale and large-scale supervised fine-tuning, achieving competitive pass rates and highlighting data efficiency in modern LLM benchmarks.

The CodeExercises dataset refers to a collection of large-scale, high-quality Python programming problems and solutions, accompanied by synthetic or human-generated metadata and extensive test-case coverage. Developed to benchmark and accelerate the research and development of code generation models—especially LLMs—the dataset is designed with a primary focus on robust evaluation and contamination-free supervised fine-tuning. CodeExercises, under the name "LeetCodeDataset," is specifically curated from LeetCode Python problems and is intended to address the methodological limitations of prior code-generation benchmarks in terms of coverage, metadata, and training/test separation (Xia et al., 20 Apr 2025).

1. Dataset Construction and Scope

CodeExercises is built through a rigorous multi-stage pipeline starting with the extraction and validation of LeetCode's Python problems. The initial candidate pool comprised 3,115 problems (as of March 2025), from which 2,869 problems were retained after filtering for complete processability. Key steps include:

Metadata harvesting: Problem statements, starter code, difficulty tiers, problem IDs, slugs, and topic tags are extracted via the LeetCode GraphQL API.
Solution validation: Canonical solutions are obtained from vetted public repositories (doocs/leetcode, walkccc/LeetCode), then verified for 100% acceptance via the live LeetCode judge.
Test input identification: Problems not conforming to a tractable function entry point or not yielding ground-truth outputs in a sandboxed setting are excluded.
Difficulty and topic coverage: Distribution is balanced, comprising 686 (23.9%) Easy, 1,498 (52.2%) Medium, and 686 (23.9%) Hard problems, spread across more than twenty algorithmic and data-structure categories.

Annual addition of new LeetCode problems (≈350/year between 2020–2025) ensures contemporaneity and breadth (Xia et al., 20 Apr 2025).

2. Metadata Annotation and Test-Case Generation

Each instance in CodeExercises is richly annotated with both structural and functional metadata:

Fields covered:
- slug (string), question_id (int), difficulty (Easy/Medium/Hard), problem_description, starter_code, topic_tags (list), release_date, canonical_solution, entry_point (function under test).
Test harness: For every problem, a two-stage input generation pipeline is employed:
1. An LLM provides valid (often simple) inputs.
2. Additional LLM prompts elicit complex and edge-case inputs.
Test suite scale: On average, each problem is associated with over 100 distinct test-case inputs, including stress and corner cases.
Output curation: All test inputs are executed in a secure Python sandbox (with data structures such as ListNode and TreeNode handled via helper classes), and the outputs are stored to facilitate post hoc automatic evaluation.

This high-density test suite reduces the incidence of false positives by requiring accurate handling of a large, challenging space of scenarios (Xia et al., 20 Apr 2025).

3. Temporal Splitting: Contamination-Free Evaluation

To rigorously avoid any potential overlap between training and evaluation, CodeExercises employs a strict temporal split, based on the official LeetCode problem release date:

Training split: Problems released before July 1, 2024.
Test split: Problems released on or after July 1, 2024.

Letting $\mathrm{date}(p)$ denote the release date of problem $p$ ,

$t_{\mathrm{train}} = \{p \mid \mathrm{date}(p) < 2024\text{-}07\text{-}01\}$

$t_{\mathrm{test}} = \{p \mid \mathrm{date}(p) \ge 2024\text{-}07\text{-}01\}$

This temporal barrier is empirically more robust than random or content-based splits, effectively mitigating training-test contamination from pretraining data harvested before mid-2024 (Xia et al., 20 Apr 2025).

4. Evaluation Framework and SFT Protocols

CodeExercises underpins both evaluation and supervised fine-tuning (SFT) via:

Automated evaluation harness:
- For each test problem, $k$ samples are generated for inference, which are then executed against the >100 ground-truth test cases in a sandboxed Python environment.
Metrics:
- pass@1 and pass@5 (proportion of test problems passed by at least one of $k$ samples)
- Exact match accuracy on canonical test outputs.
SFT regimes:
- Small-scale SFT: 2.6K model-generated solutions filtered through test cases and hints.
- Large-scale SFT: 110K human-authored samples (e.g., Magicoder-Evol-Instruct-110K).
Training parameters: Qwen2.5-Coder-7B as the base; three epochs; learning rate $\eta_0=1 \times 10^{-5}$ ; batch size 32; cosine decay schedule; standard token-wise cross-entropy loss.

The evaluation toolkit is publicly available, including all scripts necessary for reproducibility (Xia et al., 20 Apr 2025).

5. Experimental Results and Data Efficiency Findings

Empirical studies conducted with CodeExercises demonstrate:

Superior data efficiency: SFT with as few as 2.6K model-generated problems achieves parity or improves upon SFT with 110K human samples—e.g., HumanEval: 79.9% vs. 77.4%; MBPP: 77.5% vs. 74.1%.
Benchmarking performance: On the post-July 2024 test split (256 problems), reasoning-enabled architectures achieve significantly higher pass@1 scores (e.g., DeepSeek-R1 at 65.23%) than non-reasoning models (e.g., GPT-4o at 35.55%). Gaps become pronounced for medium and hard problems (up to 41.86 points difference at the hard tier).
Robustness of split: Monthly pass rate plots confirm negligible overlap between training and test sets, validating the temporal-split methodology.

This suggests that a hybrid methodology combining broad pretraining with sample-efficient, temporally validated SFT is optimal for current LLMs (Xia et al., 20 Apr 2025).

6. Dataset Access, Practical Usage, and Extensions

The complete CodeExercises (LeetCodeDataset) is openly available:

HuggingFace: https://huggingface.co/datasets/newfacade/LeetCodeDataset
GitHub: https://github.com/newfacade/LeetCodeDataset

License is permissive and includes source problems, metadata, starter code, canonical solutions, full test suites, and temporal splits. The included sandbox supports both RL and preference optimization using the integrated test suite as a reward function.

Best practice recommendations:

Use pre-July 2024 data for SFT.
Use post-July 2024 data exclusively for clean benchmarking.
The infrastructure is extensible for reinforcement-learning or direct preference optimization research (Xia et al., 20 Apr 2025).

7. Context, Limitations, and Comparative Position

CodeExercises addresses prior limitations of closed-domain or insufficiently-curated code datasets by providing:

Large scale (2,869 curated problems)
Dense test coverage (>100 test cases/problem)
Rich metadata spanning difficulty, topics, and historical context
Strict contamination controls via temporal splits

Limitations acknowledged include the exclusion of problems without tractable function entry points, and that SFT using smaller data volumes may not fully address extremely hard or highly out-of-distribution tasks.

Relative to earlier datasets, CodeExercises advances the standard by enabling high-fidelity, temporally robust benchmarks and data-efficient LLM training for code (Xia et al., 20 Apr 2025).

Markdown Report Issue Upgrade to Chat

References (1)

LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CodeExercises Dataset.