CodeSimpleQA-Instruct: Factual QA for Code LLMs

Updated 29 December 2025

CodeSimpleQA-Instruct is a large-scale, bilingual corpus and evaluation framework focused on factual question answering in code domains, covering over 15 languages and 21 computer science fields.
It employs a robust multi-stage pipeline including document retrieval, content cleaning, clustering, and LLM-based quality filtering to generate verifiable and time-invariant QA pairs.
The framework integrates supervised fine-tuning and reinforcement learning optimization, significantly improving factual consistency and mitigating LLM hallucinations in code-related queries.

CodeSimpleQA-Instruct is a large-scale, instruction-tuned corpus and evaluation framework targeting factual question answering in code-related domains. Developed to address the persistent factuality gaps in code LLMs, it provides a comprehensive resource for aligning model outputs with verifiable, authoritative programming knowledge across multiple languages and subfields (Yang et al., 22 Dec 2025).

1. Motivation, Scope, and Objectives

CodeSimpleQA-Instruct was created in response to the observation that most code LLM benchmarks—such as HumanEval, APPS, CodeXGLUE, and MultiPL-E—primarily measure code execution correctness, offering little insight into the factual accuracy of answers to programming concept queries, API specifications, or best practices. Existing factual QA resources (e.g., SimpleQA, Chinese SimpleQA) lack code-specific coverage. The design goal is twofold: massively scale factual code question-answer (QA) data in English and Chinese, and provide a post-training alignment recipe to directly mitigate LLM hallucination and knowledge imprecision. Coverage spans ≥15 languages and 21 major computer science (CS) domains, with explicit focus on time-invariant, document-verifiable answers (Yang et al., 22 Dec 2025).

2. Corpus Construction Pipeline

The CodeSimpleQA-Instruct corpus comprises approximately 66 million bilingual (EN/ZH) instruction pairs and is built via a multi-stage pipeline to maximize quality and coverage.

Document Retrieval: The pipeline crawls Common Crawl for web pages, using a fastText classifier to identify "code-related" documents. Pages are then ranked by domain reliability (favoring official documentation over Q&A forums, then tutorials/blogs) and filtered using LLM-based heuristics for technical adequacy and appropriate code/text ratios.
Content Cleaning: HTML artifacts, advertisements, and boilerplate are stripped via rule-based methods.
Clustering: Passages are embedded with a code-specialized BERT model (code-text BERT), and DBSCAN is applied to form knowledge clusters emphasizing both diversity and depth (e.g., API usage, debugging, implementation idioms).
Uniform Sampling: Clusters are uniformly subsampled to prevent overrepresentation of popular topics.
QA Generation: DeepSeek-V3.1 (temperature=0.1, few-shot template) generates question-answer pairs in a structured JSON format, enforcing constraints: objectivity, ambiguity avoidance, and temporal invariance.
Quality Filtering: Each candidate (question, answer, supporting passage) is filtered via an LLM-as-a-Judge module, retaining only those labeled "CORRECT" (Yang et al., 22 Dec 2025).

Corpus Statistics

Dataset Split	English	Chinese
CodeSimpleQA-Instruct	53,571,094	13,359,625
CodeSimpleQA (test)	782	716

The data spans domains such as Web Technologies, Software Engineering, Operating Systems, Databases, Machine Learning, and Bioinformatics, as illustrated in domain distribution statistics (Yang et al., 22 Dec 2025).

3. Annotation, Supervision, and Verification

To ensure the highest fidelity in evaluation, the test set (CodeSimpleQA) is curated by human annotators: eight annotators extract and rewrite authoritative documentation into candidate pairs, which are then reviewed by senior engineers to produce 312 high-fidelity bilingual QA pairs. For the 66M training corpus, an LLM-judge automatically discards any instance not labeled “CORRECT.”

The supervised fine-tuning (SFT) prompt format is standardized: each question is paired with a target response, and both English and Chinese templates explicitly instruct bounded (<64 words) outputs. Rejection sampling is used in SFT to prune inconsistent samples, further enhancing alignment (Yang et al., 22 Dec 2025).

4. Post-Training: SFT and Reinforcement Learning

The alignment framework for CodeSimpleQA-Instruct integrates supervised fine-tuning followed by RL-based optimization.

SFT Stage: Models (e.g., Qwen2.5-Coder-32B-Instruct) are trained under a cross-entropy objective on (question, answer) pairs, using large batch sizes (1024), a cosine-decay learning rate schedule, and context windows up to 8192 tokens. The reference model after SFT is termed CodeSimpleQA-RFT.
RL Stage: A GRPO (generalized reward policy optimization) algorithm is used, with reward $r_k$ defined as exact match to ground truth, and advantage computed as $A_{k,t} = \frac{r_k - \mathrm{mean}(r)}{\mathrm{std}(r)}$ . The policy uses a clipped ratio and KL penalty, with distributed training over 64 GPUs and trajectory grouping. The resulting model is denoted CodeSimpleQA-RL. This two-stage regime systematically improves factuality-aware alignment beyond what SFT alone achieves.

Mathematical Formulation

Let $L_{SFT}(\theta) = -\sum_i \log p_\theta(y_i|x_i)$ be the SFT loss; the GRPO loss combines policy improvement with KL regularization as:

$L_{GRPO}(\theta) = E_{(q, a), \{o_k\} \sim \pi_{\theta_{old}}} \left[ \frac{1}{K} \sum_{k=1}^K \frac{1}{|o_k|} \sum_{t=1}^{|o_k|} \left( \min(A_1, A_2) + L_{KL} \right) \right],$

with $A_1 = r_{k,t}(\theta) \hat{A}_{k,t}$ , $A_2 = \mathrm{clip}(r_{k,t}(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_{k,t}$ , and $L_{KL} = -\beta D_{KL}(\pi_\theta || \pi_\mathrm{ref})$ .

5. Evaluation Methodology and Empirical Results

Metrics

Evaluation uses the CodeSimpleQA test set in both English and Chinese. The key metrics, following Chinese SimpleQA, are:

CO (Correct): Model's answer fully includes reference answer, with no contradiction.
IN (Incorrect): Contradictory answer.
NA (Not Attempted): Unanswered or lack of contradiction.
CGA (Correct Given Attempted): CO∕(CO+IN).
$F$ -Score: $2\,CO\,CGA / (CO + CGA)$ .

Benchmark Results

Key results for CodeSimpleQA-RL, CodeSimpleQA-RFT, and comparison LLMs:

Model	English $F$ (%)	Chinese $F$ (%)
CodeSimpleQA-RL	42.3	45.2
Qwen2.5-Coder-32B-Instruct	40.0	42.3
GPT-5 (frontier LLM)	62.9	67.2
GLM-4.5 (open-source leader)	≈50.0	≈50.0

Notably, even advanced open and proprietary LLMs fall well below perfect factuality. CodeSimpleQA-RL outperforms its SFT-only reference and previous baselines by 2–3 percentage points across both languages (Yang et al., 22 Dec 2025).

Additional observations:

Chain-of-thought reasoning mode ("Thinking") consistently provides a +5–15 percentage point increase relative to direct "Chat" mode.
Domain coverage: Models perform best in Web Technologies, Software Engineering, and Programming Languages; lowest in Bioinformatics, Theory of Computation, and Graphics.
RAG vs SFT: Retrieval-augmented generation (RAG) achieves ~71% but is dependent on external documentation, while SFT achieves greater stability and inference speed.

6. Interpretation, Limitations, and Applications

The CodeSimpleQA-Instruct corpus and framework directly confront the inadequacy of code execution benchmarks for knowledge-intensive QA. Through web-scale curation, cluster-based sampling, and LLM-verification, the dataset supports evaluation and alignment at factual granularity. The integration of RL fine-tuning (GRPO) further bridges the factuality gap left by supervised-only training.

Limitations include restricted test set size, support only for English and Chinese, a focus on short-form QA without multi-step reasoning, and reliance on LLM-based correctness judgments which may introduce bias or obsolescence as APIs evolve. The post-training framework may require periodic corpus refreshes to counteract drift in documentation (Yang et al., 22 Dec 2025).

A plausible implication is that large bilingual instruction-tuning datasets, backed by rigorous filtering and RL-based alignment, constitute a critical foundation for developing code LLMs that meet production requirements for factual reliability—especially in automated software engineering, technical search, and code documentation scenarios.

7. Relationship to Prior and Emerging Work

Unlike prior CodeQA datasets focusing on free-form or span-based code answer generation (Liu et al., 2021), and unlike synthetic instruction datasets emphasizing code synthesis or program repair (e.g., Infinite-Instruct, Semi-Instruct) (Xing et al., 29 May 2025, Luo et al., 1 Mar 2024), CodeSimpleQA-Instruct is uniquely targeted at factual code QA at scale in both English and Chinese, across a broader language and topical set. Evaluation protocols and category distributions are explicitly aligned with general factual QA (SimpleQA, Chinese SimpleQA) but enforce code-specific provenance and verifiability.

This suggests that as code LLMs mature, benchmarks like CodeSimpleQA-Instruct will be essential for robustly quantifying and improving factual knowledge, not just code correctness, significantly influencing both model development and applied deployment in code-intensive domains (Yang et al., 22 Dec 2025).