CodeSimpleQA: Factual Code QA Framework

Updated 29 December 2025

CodeSimpleQA is a comprehensive framework and benchmark designed to assess factual accuracy in code-related QA using 1,498 human-curated examples across 15 programming languages and 21 CS domains.
It incorporates a large-scale instruction corpus of 66.9 million samples generated via document recall, clustering, and templated QA generation to ensure objective and concise responses.
The post-training strategy combines supervised fine-tuning and reinforcement learning, using an LLM-as-a-Judge for binary rewards, to significantly enhance factual consistency.

CodeSimpleQA is a comprehensive framework and benchmark for evaluating and improving the factual accuracy of LLMs in code-related question answering. It encompasses a bilingual QA benchmark, a massive instruction corpus, and a post-training methodology that jointly utilizes supervised fine-tuning and reinforcement learning. The system addresses critical gaps in the assessment and enhancement of factual knowledge in code LLMs, distinct from traditional code execution correctness evaluations (Yang et al., 22 Dec 2025).

1. Composition and Structure of CodeSimpleQA Benchmark

CodeSimpleQA defines a benchmark with 1,498 rigorously human-curated question–answer pairs, split between 782 English and 716 Chinese examples. The benchmark is explicitly designed to evaluate the factuality of LLM responses on programming knowledge questions that are objective, time-invariant, and admit a single correct answer. The questions span over 15 programming languages—such as Java (83 QAs), Python (76), C#, C/C++, SQL, PHP, and JavaScript—supplemented by a substantial set (911 QAs) of “General” software engineering questions. The test suite systematically samples 21 major computer science domains, including but not limited to Software Engineering (180), Web Technologies (277), Programming Languages (252), Databases (93), and Cybersecurity (63). Each answer is concise (≤ 64 words).

QA pairs are derived via manual extraction of factual statements from sources including Stack Overflow, GitHub documentation, official language documentation (e.g., Python.org, MDN), and high-frequency platforms in the Chinese developer ecosystem (CSDN, SegmentFault). The curation pipeline consists of rewriting statements into QAs by eight trained annotators, followed by independent review from three senior engineers. For QA verification, an LLM-as-a-Judge system labels each pair as CORRECT, INCORRECT, or NOT_ATTEMPTED. Only QAs with a CORRECT label and at least three reviewer consensuses are retained, ensuring high annotation quality (Yang et al., 22 Dec 2025).

2. CodeSimpleQA-Instruct: Large-Scale Instruction Corpus

CodeSimpleQA-Instruct extends the benchmark into a large-scale instruction corpus comprising 66.9 million samples (53.6M English, 13.4M Chinese), enabling robust post-training. Its construction involves four phases: (1) document recall—identifying code-related texts in the Common Crawl using a fastText-based classifier, then scoring sources by domain reliability; (2) knowledge clustering—embedding all documents using a code-text BERT and clustering topics (e.g., API usage, debugging) via DBSCAN; (3) QA generation—uniform sampling of documents per cluster and templated QA generation by Deepseek-V3.1 at low temperature (0.1), under strict prompt constraints for objectivity and brevity; (4) LLM-guided verification, retaining only samples labeled “CORRECT”.

The final corpus is distributed in a consistent format: each record contains an instruction (programming question with an explicit <64 word response constraint) and a concise, ground-truth factual answer (Yang et al., 22 Dec 2025).

3. Post-Training Framework: Supervised Fine-Tuning and Reinforcement Learning

The CodeSimpleQA post-training regime operates in two stages:

Supervised Fine-Tuning (SFT): The baseline model, Qwen2.5-Coder-32B-Instruct, is fine-tuned on the 66.9M CodeSimpleQA-Instruct corpus. Hyperparameters: Adam optimizer, cosine decay with 100 warmup steps, peak learning rate $6\times10^{-5}$ , global batch size 1,024, maximum sequence length 8,192, tensor parallelism (size 4), and 32× NVIDIA H20 GPUs.
Reinforcement Learning (GRPO): Group Relative Policy Optimization, a variant of PPO, utilizes an LLM-as-a-Judge reward indicator. The reward is binary: $r_k = \mathbb{I}(o_k; a)$ —1 if the model output $o_k$ matches the reference $a$ , 0 otherwise. Normalized advantage estimation and clipped objectives regulate policy updates. RL deployment uses a constant learning rate ( $5\times10^{-7}$ ), batch size 1,024 queries, 8 trajectories per group, and 64 GPUs in FSDP with a vLLM inference backend. The post-training objective combines SFT and RL updates to optimize accuracy and factual consistency (Yang et al., 22 Dec 2025).

4. Evaluation Metrics, Baselines, and Quantitative Results

CodeSimpleQA introduces custom factuality metrics:

Correct (CO): Generated answer matches reference with no contradiction.
Not Attempted (NA): Incomplete answer without contradiction.
Incorrect (IN): Contains contradictions to reference.
Correct Given Attempted (CGA): $\mathrm{CO}/(\mathrm{CO} + \mathrm{IN})$ .
F-score: Harmonic mean of CO and CGA.

Performance comparison encompasses both proprietary and open-source LLMs. GPT-5 leads with 67.2% (Chinese F1) and 62.9% (English F1). The best open-source baseline, GLM-4.5, scores 50.9% (Zh) and 45.0% (En); Qwen2.5-Coder-32B-Instruct reports a 40% F1 in both languages. Post-training with SFT alone yields a marginal gain (+2.3pp Zh, -0.8pp En), while SFT+RL increases performance to 45.2% Zh and 42.3% En (+5.2pp and +2.3pp, respectively) (Yang et al., 22 Dec 2025).

Additional analyses demonstrate that retrieval-augmented generation (RAG) substantially outperforms SFT (pre-2024/post-2024 docs: 71.0%/68.0% vs. 42.0%/24.0%), but requires an external knowledge base. Fact recall and factual consistency improve dramatically in domains with robust documentation, while highly specialized areas (e.g., Bioinformatics, Theory of Computation) remain challenging (<20% F1). Model scale and “thinking” mode both correlate strongly with performance ( $\mathrm{score} \approx a\ln(\#\text{params}) + b,\ R^2\approx0.92$ –0.96).

5. Comparative Context: CodeSimpleQA versus Code Search QA Systems

While CodeSimpleQA targets factual QA over code-related topics, earlier research such as CoSQA (Huang et al., 2021) addresses semantic matching between natural language queries and code snippets for code retrieval and simple QA. CoSQA consists of 20,604 Python query–code pairs annotated by three Python-experienced raters (average Krippendorff’s α 0.63). It uses a two-step annotation (intent filtering, answer labeling), with positive/negative labeled examples grounded in whether a function implements the query. CodeSearchNet code (function+docstring) is the primary source.

Contrastive learning via CoCLR integrates in-batch negative sampling and simple query rewrites (delete/swap/copy words) as data augmentation. Fine-tuning CodeBERT on CoSQA yields a 5.1pp QA accuracy gain; CoCLR-style contrastive training increases accuracy by an additional 10.5pp (CodeXGLUE Python WebQueryTest: 63.4% accuracy, 64.7% MRR, new SOTA). Analysis reveals documentation in code (docstring, signature) is critical to retrieval effectiveness (>6pp MRR drop without docstring).

A plausible implication is that factual QA (CodeSimpleQA) and semantic code retrieval QA (CoSQA, CoCLR) require distinct evaluation targets and model adaptations. CodeSimpleQA’s factual evaluation and massive instruction corpus provide critical coverage and alignment mechanisms for LLM code models, while CoSQA and related contrastive pretraining frameworks remain effective for high-precision code retrieval (Huang et al., 2021, Yang et al., 22 Dec 2025).

6. Limitations and Future Directions

CodeSimpleQA’s benchmark covers only English and Chinese, restricting its cross-lingual generality. The test set, while more extensive than predecessors, remains limited at 1,498 QAs and may miss long-tail or edge-case factuality failures. Benchmark focus is on short-form, single-answer factual QA, not multi-turn or generative code synthesis. The use of LLM-as-a-Judge verification introduces dependency on the implicit biases and knowledge scope of current models. Documentation and ground-truth answers may become outdated, especially in fast-evolving technical domains (Yang et al., 22 Dec 2025).

Proposed future directions include expanding to more languages and computer science subfields, integrating dynamic/continually updated benchmarks, curating multi-hop and code-generation tasks, involving human-in-the-loop judgment, and fusing retrieval-augmented and parametric memory LLM systems for improved factuality. This suggests a hybrid direction for factual code QA evaluation and model alignment, leveraging both scalable instruction synthesis and retrieval-augmented methodologies.

References:

"CoSQA: 20,000+ Web Queries for Code Search and Question Answering" (Huang et al., 2021)
"CodeSimpleQA: Scaling Factuality in Code LLMs" (Yang et al., 22 Dec 2025)