CoDiQ-Corpus: Controlled-Difficulty Q&A Dataset

Updated 9 February 2026

CoDiQ-Corpus is a large-scale dataset of over 44K math and coding question sequences with controllable difficulty progression and rigorous solvability checks.
The dataset employs a two-step generation methodology, using test-time token scaling and human evaluation to ensure both increasing complexity and practical solvability.
Benchmark comparisons demonstrate that training models with CoDiQ leads to significant performance gains, marking its impact on advancing reasoning model curricula.

CoDiQ-Corpus is a large-scale, open-source dataset of competition-grade question sequences spanning mathematical and coding domains, generated using the CoDiQ (Controllable Difficult Question Generation) framework. The corpus addresses the need for high-difficulty, precisely calibrated, and reliably solvable questions to train and evaluate Large Reasoning Models (LRMs). CoDiQ-Corpus, built via test-time scaling with explicit difficulty progression and human-verified solvability, enables curriculum learning and robust benchmarking of advanced reasoning capabilities in machine learning systems (Peng et al., 2 Feb 2026).

1. Dataset Composition and Structure

CoDiQ-Corpus comprises 44,453 question sequences, systematically split between mathematics and programming domains. Each sequence is generated from one of eight established seed corpora, incorporating content diversity and broad topical coverage:

Seed Corpus	Domain	Sequence Count
Math12K	Math	11,764
GSM8K	Math	8,685
SVAMP	Math	804
ASDiv	Math	1,480
CodeAlpaca20K	Code	17,845
LeetCodeDataset	Code	2,027
MBPP	Code	876
DS-1000	Code	972

Each record in the corpus adheres to a standardized structure:

"seed": The initial question $Q_0$ .
"upgrades": A list of iteratively generated, more difficult versions ( $Q_1,\ldots,Q_t$ ).
"difficulty_scores": Monotonically increasing values $[d_0,\ldots,d_t]$ with $d_i \in [0,1]$ , as determined by the framework's scoring function (see Section 3).
"solvability": Boolean flag; always true for retained instances.

Sequences average approximately 4.2 rounds, and each "upgrade" $Q_i$ is guaranteed to be strictly more difficult than $Q_{i-1}$ while retaining solvability. Up to $T_{\max}=8$ rounds or token limits (approximately 12,590 tokens) dictate sequence length.

2. Generation Methodology

CoDiQ employs a two-step process to ensure both granularity of difficulty and rigorous solvability:

Test-Time Scaling:

By allocating progressively larger reasoning-token budgets $\alpha$ at inference, the system can increase question complexity and required solution depth. However, this scaling introduces a trade-off: as $\alpha$ increases, difficulty rises but solvability may decline. The normalized difficulty assignment for each question $Q_i$ uses:

$d_i = \frac{j-1}{G-1}, \quad j=1\ldots G$

where $Q_i$ is grouped into difficulty cluster $g_j$ using LLMs-Ranking or the ValueNetwork.

Intrinsic Upper Bound:

The generation process exhibits a saturation effect. Without explicit solvability checks, models plateau at a maximum attainable difficulty $d_{\max}$ , termed the "theorem ceiling." Empirically, ablations that omit the solvability step produce, for instance, DR-AVG of 69.8% for Qwen3-32B versus 62.4% for CoDiQ-Gen-8B. All sequences strictly enforce solvability at each stage.

Training the Generator:

The backbone is Qwen3-8B, further optimized via RL on a dataset of 1,173 "boundary-failure" trajectories, where the model failed at round $i$ but passed $i-1$ . The reward function is:

$r = 0$ (invalid)
$r = 0.6 \cdot conf$ if $\Delta D = 0$
$r = 0.2 \cdot conf + 0.8 \cdot (0.8 + 0.2 \cdot \Delta D)$ if $\Delta D > 0$

with $conf = \max(0.5, confidence)$ and $\Delta D = d_i-d_{i-1}$ . Optimization uses the GRPO algorithm within the VeRL framework.

3. Difficulty Calibration and Solvability Assessment

Difficulty is quantified with two independent, [0,1]-scaled estimators:

DS-LLM: Listwise ranking via the Doubao-Seed-1.8 LLM.
DS-VN: ValueNetwork-based failure probability estimator (lower value indicates higher difficulty).

Difficulty strongly correlates with token count ( $r_{DS-LLM, tokens}=0.8299$ , $r_{DS-VN, tokens}=0.8545$ , $p<0.001$ ).

Solvability of the retained dataset is ensured with:

Strict solver-verification: Only instances with $\text{solvability\_confidence} \ge 0.8$ and strictly monotonic $d_i$ are preserved.
Human evaluation (N=200 stratified sample, 3 PhD raters): Clarity, Completeness, Validity assessed; Fleiss’ $\kappa$ of 0.76, precision@accepted=82%, NPV@rejected=90%.

This methodology aims to exclude "fake hard" questions—problems that are unsolvable but appear difficult to scoring models.

4. Benchmark Comparison and Empirical Validation

Relative to prior datasets, CoDiQ-Corpus achieves substantially higher difficulty scores while retaining high solvability:

Dataset	DR-LLM	DR-VN	DR-AVG
AIME (1983–2024)	57.9	45.1	51.5
LiveCodeBench	39.4	45.2	42.3
Code-Contests	47.2	41.0	44.1
CoDiQ-Corpus	91.4	82.8	87.1

Training LRMs on CoDiQ-Corpus delivers marked gains in reasoning benchmarks. Using curriculum stages (CoDiQ-L1/L2/L3-4B):

Model	MATH-500	AIME 2024
Qwen3-4B	94.4	63.1
Qwen3-RL-4B	95.2	64.3
CoDiQ-L1-4B	96.0	65.0
CoDiQ-L2-4B	94.8	66.7
CoDiQ-L3-4B	96.0	70.6

Statistically significant improvements are observed (up to +7.5 points on AIME, $p<0.01$ , paired bootstrap), validating the effectiveness of controlled-difficulty curricula.

5. File Format, Licensing, and Usage Guidelines

CoDiQ-Corpus is distributed in JSONL format, with each line recording:

{
  "seed": ...,
  "upgrades": [...],
  "difficulty_scores": [...],
  "domain": "math"/"code"
}

The resource is released under the Apache 2.0 license. Implementation code and data are accessible at https://github.com/ALEX-nlp/CoDiQ.

Recommended usage practices include:

Fine-tune target models with progressive token budgets aligned to desired difficulty;
Use the provided $d_i$ scores to stratify datasets for curriculum learning;
Apply both DS-LLM and DS-VN during evaluation for robust difficulty assignment;
Ensure adherence to the solver-verification protocol to filter out unsolvable, artificially difficult samples.

6. Applications and Research Significance

CoDiQ-Corpus addresses key deficits in automated question generation—difficulty controllability, computational scalability, and the assurance of solution validity. It serves as a high-fidelity resource for:

Developing and benchmarking reasoning models in mathematics and programming;
Investigating the effects of controlled-difficulty curricula on learning dynamics;
Establishing new baselines for competition-level question-answering robustness.

This suggests the corpus is pivotal for advancing LRM research where fine-grained calibration of question difficulty and empirically validated solvability are paramount. Its dual scorer system, human-verified quality control, and rigorous construction pipeline collectively set a new standard for question-generation corpora (Peng et al., 2 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CoDiQ-Corpus.