Papers
Topics
Authors
Recent
Search
2000 character limit reached

CodeContests Dataset Overview

Updated 18 February 2026
  • CodeContests is a large-scale dataset of competitive programming problems featuring thousands of authentic tasks and millions of real-world submissions from platforms like Codeforces and AtCoder.
  • The dataset includes multiple variants—CodeContests, CodeContests+, CodeContests-O, and ConPlag—each employing distinct methodologies such as mutation-based tests, LLM-driven generation, and closed-loop refinement.
  • Researchers use CodeContests for benchmarking code generation, test-case synthesis, plagiarism detection, and solution explanation, driving advances in program synthesis and automated evaluation.

CodeContests is a family of large-scale datasets focused on competitive programming tasks, widely used for benchmarking, training, and analysis of code generation, program synthesis, plagiarism detection, and robust test-case evaluation. The dataset—and its various successors—aggregate thousands of authentic competition problems and millions of real-world submissions primarily from Codeforces, AtCoder, and similar online judges. These datasets are specifically constructed to challenge automated solver models, particularly LLMs, and to facilitate research in test-case generation, solution similarity, solution explanation, and contest-specific code analysis.

1. Dataset Composition and Dataset Variants

The original CodeContests corpus, as cited in multiple works, consists of 13,610 competitive-programming problems sourced mainly from Codeforces, AtCoder, and related platforms. Each problem is accompanied by:

  • The full natural-language statement (multi-paragraph descriptions, explicit input/output sections, constraints, sample I/O).
  • Public (“sample”) test cases and a larger pool of private (hidden) test cases.
  • Extensive pools of correct and incorrect contestant submissions (solutions).

The initial dataset has been used directly or further refined in subsequent derivatives:

Variant # Problems Description Notable Features Source / Reference
CodeContests 13,610 Original mutation-based test suites, crowd solutions Mutation-based test generation, scale (Wang et al., 6 Jun 2025, Cai et al., 20 Jan 2026)
CodeContests+ 11,690 LLM-generated Generator/Validator test cases High-quality, constraint-valid test suites (Wang et al., 6 Jun 2025, Xu et al., 7 Jan 2026)
CodeContests-O 13,610 Feedback-driven, iterative test-case refinement Highest TPR/TNR, closed-loop synthesis (Cai et al., 20 Jan 2026)
ConPlag 21 Task-specific contest plagiarism dataset (Java) 911 labeled code pairs (raw and template-free) (Slobodkin et al., 2023)

This table summarizes the main CodeContests family datasets and their core properties.

2. Problem and Submission Structure

In all major variants, CodeContests problems adhere to the canonical competitive-programming specification:

  • Statement format: Multi-paragraph problem description, explicit Input section, Output section, formal Constraints, and Sample Input/Output blocks. Structure is preserved in both human and synthetic data generations (Kuznia et al., 2022).
  • Programming languages: Candidate submissions are available in C++, Python, Java. Datasets such as CodeContests+ are multilingual; most code-generation benchmarks target C++ or Python (Ridnik et al., 2024, Xu et al., 7 Jan 2026).
  • Submissions: True pools of correct and incorrect solutions (over 13 million submissions in CodeContests+) labeled by original contest judge outcome (Xu et al., 7 Jan 2026).

3. Test Case Generation and Evaluation Methodologies

Baseline Mutation Methods

The original dataset relied on mutation-based strategies to expand contest test suites: perturbing observed cases to synthetically grow coverage. However, only approximately 67.1% of mutated inputs adhered to published constraints, resulting in misclassifications (e.g., high rates of both false positives and false negatives) (Wang et al., 6 Jun 2025).

LLM-Based Generator-Validator (CodeContests+)

CodeContests+ introduced an LLM-driven test-case synthesis architecture using paired generator and validator agents:

  • The generator parses and formalizes all numeric/value constraints, designs adversarial (corner, boundary, stress) scenarios, and emits testlib-based C++ generators.
  • The validator independently encodes constraint validation, rejecting any generated inputs that violate problem specs.
  • An iterative loop ensures that all test cases in the suite are “verified correct," i.e., all generated inputs pass validation (Wang et al., 6 Jun 2025).

Feedback-Driven Iterative Refinement (CodeContests-O)

CodeContests-O formalizes test case improvement as a closed feedback loop:

  • An LLM generates initial generators and parameter lists.
  • Inputs are executed against the full pool of correct/incorrect submissions, and failures (false positives, false negatives) are distilled into diagnostic feedback.
  • The LLM incorporates this feedback to revise the generator or commands, repeating until both True Positive Rate (TPR) and True Negative Rate (TNR) meet thresholds (Cai et al., 20 Jan 2026).

Empirical results indicate that CodeContests-O test suites (iteration 3) achieve TPR = 89.37% and TNR = 90.89%, outperforming both CodeContests and CodeContests+ by 4–9% in discriminating faulty solutions (Cai et al., 20 Jan 2026).

4. Data Splits, Metadata, and Schema

CodeContests and its variants offer splits for benchmarking and training:

  • Train/validation/test: E.g., 10,000 train, 107 validation, 165 test (AlphaCodium/CodeContests); subsets are selected to avoid overlap and ensure recency for LLM benchmarking (Ridnik et al., 2024, Li et al., 2023).
  • Schema: Each problem entry typically features:
    • Unique identifier
    • Statement text (“description”)
    • Input/Output specifications and constraints
    • Sets of public, private, and (for some versions) LLM-generated test cases
    • Solution pools (segregated by correctness, language)
    • For some extensions, formal grammars (CCFGs) or explanation annotations (Sung et al., 21 May 2025, Li et al., 2023).

5. Benchmarking, Metrics, and Empirical Outcomes

Primary Evaluation Metrics

Benchmark Comparisons

AlphaCodium (tested on CodeContests) demonstrated that multi-stage, test-based LLM workflows substantially increase pass@5 scores: GPT-4 increased from 19% (direct prompt) to 44% (AlphaCodium flow) on validation, with similar effects across models (Ridnik et al., 2024). RL-trained models on CodeContests+ and CodeContests-O yield higher success rates on LiveCodeBench (Pass@1 up to 34.57%) compared to pretraining-only baselines (Cai et al., 20 Jan 2026).

6. Notable Subsets and Use Cases

CodeContests+ for Code Similarity Research

CSSG's experiments sample problem-language triplets (e.g., two accepted and one rejected solution per problem/language), providing testbeds for discriminative code similarity models under both monolingual and cross-lingual evaluation regimes (Xu et al., 7 Jan 2026).

Program Synthesis with Summarization

Studies such as “Less is More” leverage CodeContests as a source for program synthesis on long-form, competition-style specifications, demonstrating that summarized problem statements significantly boost strict accuracy for Codex (from 12.50% to 25.00%) (Kuznia et al., 2022).

Solution Explanation and Stepwise LLM Annotation

Using the CodeContests test split, solution explanations are automatically generated in seven-step schemas (problem summary, algorithm, steps, explanation, one-sentence, complexity, proof), with expert ratings and measurable gains in solve@10 when explanations serve as LLM prompts (Li et al., 2023).

Plagiarism Detection: ConPlag Subset

ConPlag offers contest-specific plagiarism data for 21 Codeforces tasks, with 911 labeled Java code pairs each in raw and template-free form. Removal of fast I/O and algorithmic templates is vital for accurate benchmarking; results show token-based detectors (e.g., JPlag, MOSS) outperform others, especially on template-free data (Slobodkin et al., 2023).

7. Limitations and Future Prospects

While CodeContests variants have continuously improved test suite coverage, discriminability, and multilingual support, certain deficiencies persist:

  • Mutation-based or incomplete constraint extraction can permit invalid tests or insufficient adversarial cases (Wang et al., 6 Jun 2025).
  • Annotated subsets (e.g., with CCFGs for test-case generation (Sung et al., 21 May 2025)) filter out problems with complex or ambiguous specifications, biasing analyses toward more tractable instances.
  • Validator blindness and reliance on C++/testlib limit immediate extension to other languages and interactive/graphical contest problems (Wang et al., 6 Jun 2025).

Recent works aim to automate constraint validation, enrich corner-case diversity, and scale LLM-driven G-V frameworks to vastly larger corpuses (100,000+ public problems) (Wang et al., 6 Jun 2025). Closed-loop, feedback-driven generation (CodeContests-O) represents a direction where test-case fidelity and discriminative utility can be further optimized in tandem with large-scale applied LLM training (Cai et al., 20 Jan 2026).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CodeContests Dataset.