CodeContests Dataset Overview
- CodeContests is a large-scale dataset of competitive programming problems featuring thousands of authentic tasks and millions of real-world submissions from platforms like Codeforces and AtCoder.
- The dataset includes multiple variants—CodeContests, CodeContests+, CodeContests-O, and ConPlag—each employing distinct methodologies such as mutation-based tests, LLM-driven generation, and closed-loop refinement.
- Researchers use CodeContests for benchmarking code generation, test-case synthesis, plagiarism detection, and solution explanation, driving advances in program synthesis and automated evaluation.
CodeContests is a family of large-scale datasets focused on competitive programming tasks, widely used for benchmarking, training, and analysis of code generation, program synthesis, plagiarism detection, and robust test-case evaluation. The dataset—and its various successors—aggregate thousands of authentic competition problems and millions of real-world submissions primarily from Codeforces, AtCoder, and similar online judges. These datasets are specifically constructed to challenge automated solver models, particularly LLMs, and to facilitate research in test-case generation, solution similarity, solution explanation, and contest-specific code analysis.
1. Dataset Composition and Dataset Variants
The original CodeContests corpus, as cited in multiple works, consists of 13,610 competitive-programming problems sourced mainly from Codeforces, AtCoder, and related platforms. Each problem is accompanied by:
- The full natural-language statement (multi-paragraph descriptions, explicit input/output sections, constraints, sample I/O).
- Public (“sample”) test cases and a larger pool of private (hidden) test cases.
- Extensive pools of correct and incorrect contestant submissions (solutions).
The initial dataset has been used directly or further refined in subsequent derivatives:
| Variant | # Problems | Description | Notable Features | Source / Reference |
|---|---|---|---|---|
| CodeContests | 13,610 | Original mutation-based test suites, crowd solutions | Mutation-based test generation, scale | (Wang et al., 6 Jun 2025, Cai et al., 20 Jan 2026) |
| CodeContests+ | 11,690 | LLM-generated Generator/Validator test cases | High-quality, constraint-valid test suites | (Wang et al., 6 Jun 2025, Xu et al., 7 Jan 2026) |
| CodeContests-O | 13,610 | Feedback-driven, iterative test-case refinement | Highest TPR/TNR, closed-loop synthesis | (Cai et al., 20 Jan 2026) |
| ConPlag | 21 | Task-specific contest plagiarism dataset (Java) | 911 labeled code pairs (raw and template-free) | (Slobodkin et al., 2023) |
This table summarizes the main CodeContests family datasets and their core properties.
2. Problem and Submission Structure
In all major variants, CodeContests problems adhere to the canonical competitive-programming specification:
- Statement format: Multi-paragraph problem description, explicit Input section, Output section, formal Constraints, and Sample Input/Output blocks. Structure is preserved in both human and synthetic data generations (Kuznia et al., 2022).
- Programming languages: Candidate submissions are available in C++, Python, Java. Datasets such as CodeContests+ are multilingual; most code-generation benchmarks target C++ or Python (Ridnik et al., 2024, Xu et al., 7 Jan 2026).
- Submissions: True pools of correct and incorrect solutions (over 13 million submissions in CodeContests+) labeled by original contest judge outcome (Xu et al., 7 Jan 2026).
3. Test Case Generation and Evaluation Methodologies
Baseline Mutation Methods
The original dataset relied on mutation-based strategies to expand contest test suites: perturbing observed cases to synthetically grow coverage. However, only approximately 67.1% of mutated inputs adhered to published constraints, resulting in misclassifications (e.g., high rates of both false positives and false negatives) (Wang et al., 6 Jun 2025).
LLM-Based Generator-Validator (CodeContests+)
CodeContests+ introduced an LLM-driven test-case synthesis architecture using paired generator and validator agents:
- The generator parses and formalizes all numeric/value constraints, designs adversarial (corner, boundary, stress) scenarios, and emits testlib-based C++ generators.
- The validator independently encodes constraint validation, rejecting any generated inputs that violate problem specs.
- An iterative loop ensures that all test cases in the suite are “verified correct," i.e., all generated inputs pass validation (Wang et al., 6 Jun 2025).
Feedback-Driven Iterative Refinement (CodeContests-O)
CodeContests-O formalizes test case improvement as a closed feedback loop:
- An LLM generates initial generators and parameter lists.
- Inputs are executed against the full pool of correct/incorrect submissions, and failures (false positives, false negatives) are distilled into diagnostic feedback.
- The LLM incorporates this feedback to revise the generator or commands, repeating until both True Positive Rate (TPR) and True Negative Rate (TNR) meet thresholds (Cai et al., 20 Jan 2026).
Empirical results indicate that CodeContests-O test suites (iteration 3) achieve TPR = 89.37% and TNR = 90.89%, outperforming both CodeContests and CodeContests+ by 4–9% in discriminating faulty solutions (Cai et al., 20 Jan 2026).
4. Data Splits, Metadata, and Schema
CodeContests and its variants offer splits for benchmarking and training:
- Train/validation/test: E.g., 10,000 train, 107 validation, 165 test (AlphaCodium/CodeContests); subsets are selected to avoid overlap and ensure recency for LLM benchmarking (Ridnik et al., 2024, Li et al., 2023).
- Schema: Each problem entry typically features:
- Unique identifier
- Statement text (“description”)
- Input/Output specifications and constraints
- Sets of public, private, and (for some versions) LLM-generated test cases
- Solution pools (segregated by correctness, language)
- For some extensions, formal grammars (CCFGs) or explanation annotations (Sung et al., 21 May 2025, Li et al., 2023).
5. Benchmarking, Metrics, and Empirical Outcomes
Primary Evaluation Metrics
- pass@k: Fraction of problems where at least one solution in k samples passes all private tests (Ridnik et al., 2024, Li et al., 2023).
- Strict accuracy: For a single generation, proportion of problems passing all tests (Kuznia et al., 2022).
- TPR/TNR: Ability of test suites to accept correct and reject incorrect submissions, respectively (Wang et al., 6 Jun 2025, Cai et al., 20 Jan 2026).
- Set/elements-based validity/effectiveness: Fraction of test cases conforming to, or effectively exposing flaws in, candidate solutions (for test-generation studies) (Sung et al., 21 May 2025).
Benchmark Comparisons
AlphaCodium (tested on CodeContests) demonstrated that multi-stage, test-based LLM workflows substantially increase pass@5 scores: GPT-4 increased from 19% (direct prompt) to 44% (AlphaCodium flow) on validation, with similar effects across models (Ridnik et al., 2024). RL-trained models on CodeContests+ and CodeContests-O yield higher success rates on LiveCodeBench (Pass@1 up to 34.57%) compared to pretraining-only baselines (Cai et al., 20 Jan 2026).
6. Notable Subsets and Use Cases
CodeContests+ for Code Similarity Research
CSSG's experiments sample problem-language triplets (e.g., two accepted and one rejected solution per problem/language), providing testbeds for discriminative code similarity models under both monolingual and cross-lingual evaluation regimes (Xu et al., 7 Jan 2026).
Program Synthesis with Summarization
Studies such as “Less is More” leverage CodeContests as a source for program synthesis on long-form, competition-style specifications, demonstrating that summarized problem statements significantly boost strict accuracy for Codex (from 12.50% to 25.00%) (Kuznia et al., 2022).
Solution Explanation and Stepwise LLM Annotation
Using the CodeContests test split, solution explanations are automatically generated in seven-step schemas (problem summary, algorithm, steps, explanation, one-sentence, complexity, proof), with expert ratings and measurable gains in solve@10 when explanations serve as LLM prompts (Li et al., 2023).
Plagiarism Detection: ConPlag Subset
ConPlag offers contest-specific plagiarism data for 21 Codeforces tasks, with 911 labeled Java code pairs each in raw and template-free form. Removal of fast I/O and algorithmic templates is vital for accurate benchmarking; results show token-based detectors (e.g., JPlag, MOSS) outperform others, especially on template-free data (Slobodkin et al., 2023).
7. Limitations and Future Prospects
While CodeContests variants have continuously improved test suite coverage, discriminability, and multilingual support, certain deficiencies persist:
- Mutation-based or incomplete constraint extraction can permit invalid tests or insufficient adversarial cases (Wang et al., 6 Jun 2025).
- Annotated subsets (e.g., with CCFGs for test-case generation (Sung et al., 21 May 2025)) filter out problems with complex or ambiguous specifications, biasing analyses toward more tractable instances.
- Validator blindness and reliance on C++/testlib limit immediate extension to other languages and interactive/graphical contest problems (Wang et al., 6 Jun 2025).
Recent works aim to automate constraint validation, enrich corner-case diversity, and scale LLM-driven G-V frameworks to vastly larger corpuses (100,000+ public problems) (Wang et al., 6 Jun 2025). Closed-loop, feedback-driven generation (CodeContests-O) represents a direction where test-case fidelity and discriminative utility can be further optimized in tandem with large-scale applied LLM training (Cai et al., 20 Jan 2026).