CoDiQ-Corpus: Controlled-Difficulty Q&A Dataset
- CoDiQ-Corpus is a large-scale dataset of over 44K math and coding question sequences with controllable difficulty progression and rigorous solvability checks.
- The dataset employs a two-step generation methodology, using test-time token scaling and human evaluation to ensure both increasing complexity and practical solvability.
- Benchmark comparisons demonstrate that training models with CoDiQ leads to significant performance gains, marking its impact on advancing reasoning model curricula.
CoDiQ-Corpus is a large-scale, open-source dataset of competition-grade question sequences spanning mathematical and coding domains, generated using the CoDiQ (Controllable Difficult Question Generation) framework. The corpus addresses the need for high-difficulty, precisely calibrated, and reliably solvable questions to train and evaluate Large Reasoning Models (LRMs). CoDiQ-Corpus, built via test-time scaling with explicit difficulty progression and human-verified solvability, enables curriculum learning and robust benchmarking of advanced reasoning capabilities in machine learning systems (Peng et al., 2 Feb 2026).
1. Dataset Composition and Structure
CoDiQ-Corpus comprises 44,453 question sequences, systematically split between mathematics and programming domains. Each sequence is generated from one of eight established seed corpora, incorporating content diversity and broad topical coverage:
| Seed Corpus | Domain | Sequence Count |
|---|---|---|
| Math12K | Math | 11,764 |
| GSM8K | Math | 8,685 |
| SVAMP | Math | 804 |
| ASDiv | Math | 1,480 |
| CodeAlpaca20K | Code | 17,845 |
| LeetCodeDataset | Code | 2,027 |
| MBPP | Code | 876 |
| DS-1000 | Code | 972 |
Each record in the corpus adheres to a standardized structure:
"seed": The initial question ."upgrades": A list of iteratively generated, more difficult versions ()."difficulty_scores": Monotonically increasing values with , as determined by the framework's scoring function (see Section 3)."solvability": Boolean flag; always true for retained instances.
Sequences average approximately 4.2 rounds, and each "upgrade" is guaranteed to be strictly more difficult than while retaining solvability. Up to rounds or token limits (approximately 12,590 tokens) dictate sequence length.
2. Generation Methodology
CoDiQ employs a two-step process to ensure both granularity of difficulty and rigorous solvability:
Test-Time Scaling:
By allocating progressively larger reasoning-token budgets at inference, the system can increase question complexity and required solution depth. However, this scaling introduces a trade-off: as increases, difficulty rises but solvability may decline. The normalized difficulty assignment for each question uses:
where is grouped into difficulty cluster using LLMs-Ranking or the ValueNetwork.
Intrinsic Upper Bound:
The generation process exhibits a saturation effect. Without explicit solvability checks, models plateau at a maximum attainable difficulty , termed the "theorem ceiling." Empirically, ablations that omit the solvability step produce, for instance, DR-AVG of 69.8% for Qwen3-32B versus 62.4% for CoDiQ-Gen-8B. All sequences strictly enforce solvability at each stage.
Training the Generator:
The backbone is Qwen3-8B, further optimized via RL on a dataset of 1,173 "boundary-failure" trajectories, where the model failed at round but passed . The reward function is:
- (invalid)
- if
- if
with and . Optimization uses the GRPO algorithm within the VeRL framework.
3. Difficulty Calibration and Solvability Assessment
Difficulty is quantified with two independent, [0,1]-scaled estimators:
- DS-LLM: Listwise ranking via the Doubao-Seed-1.8 LLM.
- DS-VN: ValueNetwork-based failure probability estimator (lower value indicates higher difficulty).
Difficulty strongly correlates with token count (, , ).
Solvability of the retained dataset is ensured with:
- Strict solver-verification: Only instances with and strictly monotonic are preserved.
- Human evaluation (N=200 stratified sample, 3 PhD raters): Clarity, Completeness, Validity assessed; Fleiss’ of 0.76, precision@accepted=82%, NPV@rejected=90%.
This methodology aims to exclude "fake hard" questions—problems that are unsolvable but appear difficult to scoring models.
4. Benchmark Comparison and Empirical Validation
Relative to prior datasets, CoDiQ-Corpus achieves substantially higher difficulty scores while retaining high solvability:
| Dataset | DR-LLM | DR-VN | DR-AVG |
|---|---|---|---|
| AIME (1983–2024) | 57.9 | 45.1 | 51.5 |
| LiveCodeBench | 39.4 | 45.2 | 42.3 |
| Code-Contests | 47.2 | 41.0 | 44.1 |
| CoDiQ-Corpus | 91.4 | 82.8 | 87.1 |
Training LRMs on CoDiQ-Corpus delivers marked gains in reasoning benchmarks. Using curriculum stages (CoDiQ-L1/L2/L3-4B):
| Model | MATH-500 | AIME 2024 |
|---|---|---|
| Qwen3-4B | 94.4 | 63.1 |
| Qwen3-RL-4B | 95.2 | 64.3 |
| CoDiQ-L1-4B | 96.0 | 65.0 |
| CoDiQ-L2-4B | 94.8 | 66.7 |
| CoDiQ-L3-4B | 96.0 | 70.6 |
Statistically significant improvements are observed (up to +7.5 points on AIME, , paired bootstrap), validating the effectiveness of controlled-difficulty curricula.
5. File Format, Licensing, and Usage Guidelines
CoDiQ-Corpus is distributed in JSONL format, with each line recording:
1 2 3 4 5 6 |
{
"seed": ...,
"upgrades": [...],
"difficulty_scores": [...],
"domain": "math"/"code"
} |
The resource is released under the Apache 2.0 license. Implementation code and data are accessible at https://github.com/ALEX-nlp/CoDiQ.
Recommended usage practices include:
- Fine-tune target models with progressive token budgets aligned to desired difficulty;
- Use the provided scores to stratify datasets for curriculum learning;
- Apply both DS-LLM and DS-VN during evaluation for robust difficulty assignment;
- Ensure adherence to the solver-verification protocol to filter out unsolvable, artificially difficult samples.
6. Applications and Research Significance
CoDiQ-Corpus addresses key deficits in automated question generation—difficulty controllability, computational scalability, and the assurance of solution validity. It serves as a high-fidelity resource for:
- Developing and benchmarking reasoning models in mathematics and programming;
- Investigating the effects of controlled-difficulty curricula on learning dynamics;
- Establishing new baselines for competition-level question-answering robustness.
This suggests the corpus is pivotal for advancing LRM research where fine-grained calibration of question difficulty and empirically validated solvability are paramount. Its dual scorer system, human-verified quality control, and rigorous construction pipeline collectively set a new standard for question-generation corpora (Peng et al., 2 Feb 2026).