Papers
Topics
Authors
Recent
Search
2000 character limit reached

CoDiQ-Corpus: Controlled-Difficulty Q&A Dataset

Updated 9 February 2026
  • CoDiQ-Corpus is a large-scale dataset of over 44K math and coding question sequences with controllable difficulty progression and rigorous solvability checks.
  • The dataset employs a two-step generation methodology, using test-time token scaling and human evaluation to ensure both increasing complexity and practical solvability.
  • Benchmark comparisons demonstrate that training models with CoDiQ leads to significant performance gains, marking its impact on advancing reasoning model curricula.

CoDiQ-Corpus is a large-scale, open-source dataset of competition-grade question sequences spanning mathematical and coding domains, generated using the CoDiQ (Controllable Difficult Question Generation) framework. The corpus addresses the need for high-difficulty, precisely calibrated, and reliably solvable questions to train and evaluate Large Reasoning Models (LRMs). CoDiQ-Corpus, built via test-time scaling with explicit difficulty progression and human-verified solvability, enables curriculum learning and robust benchmarking of advanced reasoning capabilities in machine learning systems (Peng et al., 2 Feb 2026).

1. Dataset Composition and Structure

CoDiQ-Corpus comprises 44,453 question sequences, systematically split between mathematics and programming domains. Each sequence is generated from one of eight established seed corpora, incorporating content diversity and broad topical coverage:

Seed Corpus Domain Sequence Count
Math12K Math 11,764
GSM8K Math 8,685
SVAMP Math 804
ASDiv Math 1,480
CodeAlpaca20K Code 17,845
LeetCodeDataset Code 2,027
MBPP Code 876
DS-1000 Code 972

Each record in the corpus adheres to a standardized structure:

  • "seed": The initial question Q0Q_0.
  • "upgrades": A list of iteratively generated, more difficult versions (Q1,,QtQ_1,\ldots,Q_t).
  • "difficulty_scores": Monotonically increasing values [d0,,dt][d_0,\ldots,d_t] with di[0,1]d_i \in [0,1], as determined by the framework's scoring function (see Section 3).
  • "solvability": Boolean flag; always true for retained instances.

Sequences average approximately 4.2 rounds, and each "upgrade" QiQ_i is guaranteed to be strictly more difficult than Qi1Q_{i-1} while retaining solvability. Up to Tmax=8T_{\max}=8 rounds or token limits (approximately 12,590 tokens) dictate sequence length.

2. Generation Methodology

CoDiQ employs a two-step process to ensure both granularity of difficulty and rigorous solvability:

Test-Time Scaling:

By allocating progressively larger reasoning-token budgets α\alpha at inference, the system can increase question complexity and required solution depth. However, this scaling introduces a trade-off: as α\alpha increases, difficulty rises but solvability may decline. The normalized difficulty assignment for each question QiQ_i uses:

di=j1G1,j=1Gd_i = \frac{j-1}{G-1}, \quad j=1\ldots G

where QiQ_i is grouped into difficulty cluster gjg_j using LLMs-Ranking or the ValueNetwork.

Intrinsic Upper Bound:

The generation process exhibits a saturation effect. Without explicit solvability checks, models plateau at a maximum attainable difficulty dmaxd_{\max}, termed the "theorem ceiling." Empirically, ablations that omit the solvability step produce, for instance, DR-AVG of 69.8% for Qwen3-32B versus 62.4% for CoDiQ-Gen-8B. All sequences strictly enforce solvability at each stage.

Training the Generator:

The backbone is Qwen3-8B, further optimized via RL on a dataset of 1,173 "boundary-failure" trajectories, where the model failed at round ii but passed i1i-1. The reward function is:

  • r=0r = 0 (invalid)
  • r=0.6confr = 0.6 \cdot conf if ΔD=0\Delta D = 0
  • r=0.2conf+0.8(0.8+0.2ΔD)r = 0.2 \cdot conf + 0.8 \cdot (0.8 + 0.2 \cdot \Delta D) if ΔD>0\Delta D > 0

with conf=max(0.5,confidence)conf = \max(0.5, confidence) and ΔD=didi1\Delta D = d_i-d_{i-1}. Optimization uses the GRPO algorithm within the VeRL framework.

3. Difficulty Calibration and Solvability Assessment

Difficulty is quantified with two independent, [0,1]-scaled estimators:

  • DS-LLM: Listwise ranking via the Doubao-Seed-1.8 LLM.
  • DS-VN: ValueNetwork-based failure probability estimator (lower value indicates higher difficulty).

Difficulty strongly correlates with token count (rDSLLM,tokens=0.8299r_{DS-LLM, tokens}=0.8299, rDSVN,tokens=0.8545r_{DS-VN, tokens}=0.8545, p<0.001p<0.001).

Solvability of the retained dataset is ensured with:

  • Strict solver-verification: Only instances with solvability_confidence0.8\text{solvability\_confidence} \ge 0.8 and strictly monotonic did_i are preserved.
  • Human evaluation (N=200 stratified sample, 3 PhD raters): Clarity, Completeness, Validity assessed; Fleiss’ κ\kappa of 0.76, precision@accepted=82%, NPV@rejected=90%.

This methodology aims to exclude "fake hard" questions—problems that are unsolvable but appear difficult to scoring models.

4. Benchmark Comparison and Empirical Validation

Relative to prior datasets, CoDiQ-Corpus achieves substantially higher difficulty scores while retaining high solvability:

Dataset DR-LLM DR-VN DR-AVG
AIME (1983–2024) 57.9 45.1 51.5
LiveCodeBench 39.4 45.2 42.3
Code-Contests 47.2 41.0 44.1
CoDiQ-Corpus 91.4 82.8 87.1

Training LRMs on CoDiQ-Corpus delivers marked gains in reasoning benchmarks. Using curriculum stages (CoDiQ-L1/L2/L3-4B):

Model MATH-500 AIME 2024
Qwen3-4B 94.4 63.1
Qwen3-RL-4B 95.2 64.3
CoDiQ-L1-4B 96.0 65.0
CoDiQ-L2-4B 94.8 66.7
CoDiQ-L3-4B 96.0 70.6

Statistically significant improvements are observed (up to +7.5 points on AIME, p<0.01p<0.01, paired bootstrap), validating the effectiveness of controlled-difficulty curricula.

5. File Format, Licensing, and Usage Guidelines

CoDiQ-Corpus is distributed in JSONL format, with each line recording:

1
2
3
4
5
6
{
  "seed": ...,
  "upgrades": [...],
  "difficulty_scores": [...],
  "domain": "math"/"code"
}

The resource is released under the Apache 2.0 license. Implementation code and data are accessible at https://github.com/ALEX-nlp/CoDiQ.

Recommended usage practices include:

  • Fine-tune target models with progressive token budgets aligned to desired difficulty;
  • Use the provided did_i scores to stratify datasets for curriculum learning;
  • Apply both DS-LLM and DS-VN during evaluation for robust difficulty assignment;
  • Ensure adherence to the solver-verification protocol to filter out unsolvable, artificially difficult samples.

6. Applications and Research Significance

CoDiQ-Corpus addresses key deficits in automated question generation—difficulty controllability, computational scalability, and the assurance of solution validity. It serves as a high-fidelity resource for:

  • Developing and benchmarking reasoning models in mathematics and programming;
  • Investigating the effects of controlled-difficulty curricula on learning dynamics;
  • Establishing new baselines for competition-level question-answering robustness.

This suggests the corpus is pivotal for advancing LRM research where fine-grained calibration of question difficulty and empirically validated solvability are paramount. Its dual scorer system, human-verified quality control, and rigorous construction pipeline collectively set a new standard for question-generation corpora (Peng et al., 2 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CoDiQ-Corpus.