SWE-QA Benchmark for Code Repository QA

Updated 31 March 2026

SWE-QA Benchmark is a suite of large-scale, repository-level code QA tasks that assess language models’ ability for multi-file reasoning and semantic comprehension.
It leverages 576 curated QA pairs from diverse open-source Python repositories, employing systematic taxonomy and parameterized instantiation for robust evaluation.
The benchmark advances evaluation protocols with direct versus agentic workflows and metrics like exact match, token F1, and LLM-as-Judge ratings.

The SWE-QA benchmark is a suite of large-scale, repository-level code question answering (QA) tasks designed to rigorously evaluate LLMs' (LMs) and agentic frameworks' capabilities for holistic software repository understanding. Unlike prior benchmarks focused on single-function, snippet-based tasks, SWE-QA targets realistic questions requiring navigation, semantic comprehension, and cross-file reasoning over complex, open-source codebases. SWE-QA, together with its extension SWE-QA-Pro, forms the foundation for research in agentic repository-level QA, including robust dataset construction, novel evaluation protocols, and scalable training strategies for both closed and open-source LMs (Peng et al., 18 Sep 2025, Cai et al., 17 Mar 2026).

1. Motivation, Scope, and Benchmark Positioning

Repository-level software engineering questions often entail non-local reasoning, architectural synthesis, and identification of control/data flows that span multiple files. Existing QA datasets—such as CoSQA, CodeQA, and related benchmarks—predominantly address constrained, self-contained code snippets or singular API calls, omitting the complexities encountered in authentic codebase maintenance, debugging, and evolution. SWE-QA was developed to fill this critical gap by evaluating whether LMs can synthesize answers that require understanding interactions among disparate code elements, traversing implicit dependencies, and discerning design intent rooted in the distributed structure of modern repositories (Peng et al., 18 Sep 2025).

2. Dataset Construction: SWE-QA and SWE-QA-Pro

SWE-QA

SWE-QA contains 576 high-quality question–answer pairs curated from 12 widely used, open-source Python repositories (e.g., Astropy, Django, Flask, pytest, sqlfluff, sympy). The construction pipeline comprises four key stages:

Crawling and Question Extraction:

77,100 GitHub issues from 11 SWE-bench repositories were filtered to 41,955 issues (≥1,000 chars), and 127,415 explicit developer questions were extracted using an LLM-based parser.

Taxonomy Development and Template Design:

A two-level taxonomy emerged from open-coding 1,000 sampled questions: - Level-1: Interrogative (What, Why, Where, How) - Level-2: 12 fine-grained intentions (architecture exploration, dependency tracing, algorithm implementation, etc.). Seed templates (e.g., “How does <Module> implement <Feature>?”) were developed per Level-2 category.

Parameterized Instantiation: Repository graphs (using tree-sitter) indexed files, classes, functions, imports, and call edges. For each repository, 48 context-specific questions were synthesized and manually verified for clarity and non-compound construction.
Answer Collection and Validation: Retrieval-augmented semantic search indexed code elements and documentation. GPT-5 produced candidate answers, which were fact-checked, clarified, and filtered by expert developers until only high-quality pairs remained.

Key Dataset Statistics:

Aspect	SWE-QA Value
Repositories	12
Files	11,335
Classes	19,544
Functions	128,835
Lines of code	3,093,338
Questions	576
Avg. Q tokens	14.4
Avg. Answer tokens	142.9

SWE-QA-Pro

SWE-QA-Pro targets reasoning beyond canonical, memorized repositories, stress-testing LMs on diverse, under-studied Python projects with executable environments. Dataset construction utilized 1.6 million issues over 3,468 repos (excluding “top-100”) with executable sandboxes. Issue-driven clustering using k-means and Qwen3-8B-Embedding ensured balanced topical coverage over 48 distinct semantic clusters, and a difficulty calibration process filtered out questions requiring no agentic codebase exploration. Final dataset statistics:

Aspect	SWE-QA-Pro Value
Repositories	26 (long-tail)
Questions	260
Question balance	Where 77, How 67, What 51, Why 65
QA clusters	48 semantic

3. Formal Task Definitions and Evaluation Protocols

Taxonomy and Annotation

Questions in both SWE-QA and SWE-QA-Pro are annotated according to a two-level taxonomy:

$T: Q \rightarrow \{\text{What, Why, Where, How}\} \times \{\text{Intent}_1, \ldots, \text{Intent}_3\}$

where each “Intent” reflects a fine-grained question category beneath the top-level interrogative.

Metrics

Exact Match (EM):

$EM(q, m) = 1\ [\hat{A}_m(q) \equiv A^*(q)],\ \text{else}\ 0$

Token F1 Score:

$\text{F1}(q,m) = 2\cdot\mathrm{Precision}\cdot\mathrm{Recall}/(\mathrm{Precision}+\mathrm{Recall})$

with precision and recall defined in terms of token overlap between model output and reference.

LLM-as-Judge rubric:

Correctness, completeness, relevance, clarity, and reasoning (1–5 scale in SWE-QA, 1–10 in SWE-QA-Pro); aggregated to composite scores.

Performance gap Δ:

For SWE-QA-Pro, the agentic versus direct answering delta:

$\Delta = \mathrm{Acc}_{\mathrm{agentic}} - \mathrm{Acc}_{\mathrm{direct}}$

Human studies confirm that LLM-as-Judge assessments are aligned with expert developer ratings.

4. Evaluation Paradigms and Baseline Model Performance

SWE-QA and SWE-QA-Pro explicitly distinguish between direct and agentic evaluation setups:

Direct: Models receive only the question and/or relevant context, no code tools, and provide a single-turn answer.
Retrieval-Augmented Generation (RAG): Variants include function-chunking or sliding window, where code elements are retrieved via semantic search.
Agentic ReAct-Style Workflow (“SWE-QA-Agent”, “SWE-QA-Pro Agent”): LMs interact with the codebase via a discrete action space (e.g., GetRepoStructure, ReadFile, SearchContent, ExecuteCommand, etc.) over multiple turns, building context iteratively.

Performance Highlights:

Model	Direct Score	Agentic Score	Δ (Pro)
Claude Sonnet 4.5	27.69	40.67	~13
GPT-4o	26.58	33.08	~6.5
Qwen3-8B (SFT+RL)	–	35.39	–
Qwen2.5-Coder-32B	12.8–41.41	–	–
Claude 3.7 Sonnet	38.18	47.82	+9.64 (QA)

Agentic workflows consistently outperform direct answer baselines; Δ scores in SWE-QA-Pro confirm that repository exploration—rather than API recall or shallow pattern matching—is required for most tasks (Peng et al., 18 Sep 2025, Cai et al., 17 Mar 2026).

5. Agentic Training Recipes and Model Adaptation

To address the scarcity and difficulty of repository-level QA data, SWE-QA-Pro introduces a scalable synthetic data pipeline with a two-stage training regimen:

Supervised Fine-Tuning (SFT): Models are taught explicit tool-use sequences (e.g., Search("..."), ViewCodebase) and grounding strategies using cross-entropy loss over 1,000 synthetic conversational trajectories.
Reinforcement Learning from AI Feedback (RLAIF): Final answers are scored via a learned reward model (vector $s \in [1, 10]^5$ ), aggregated to a scalar via a weighted sum, then optimized with PPO-style or GRPO policy gradient updates. RL phase rewards correctness (+2.5 points) and completeness, and notably leads to more judicious tool use—fewer, more targeted codebase interactions.

Post-SFT+RLAIF, open models such as Qwen3-8B approach or surpass closed LMs such as GPT-4o in agentic SWE-QA-Pro performance, indicating that agentic training workflows are essential for high-level repository understanding (Cai et al., 17 Mar 2026).

6. Representative Insights, Open Challenges, and Future Directions

Category and Repository-Level Analysis:

“What” and “Why” questions score ~5–6 points higher than “Where” and “How” due to persistent challenges in procedural and locational reasoning.
Tasks focused on “Concept/Definition” or “Design Rationale” yield the best performance; “Data/Control-flow” and “Feature Location” remain most challenging.
Smaller models (e.g., Qwen2.5-Coder-32B) show brittle generalization across repositories; state-of-the-art proprietary LMs (Claude Sonnet series) offer robust, consistent performance.

Identified Challenges:

Multi-hop and multi-file reasoning remains a failure mode, especially where documentation is absent or control flow is implicit.
Data contamination and training/test leakage persist as a threat; evaluation on long-tail repositories in SWE-QA-Pro partly mitigates memorization effects.
Benchmark evolution and re-instantiation strategies are required to track codebase and API churn.
Current benchmarks focus exclusively on Python; extending QA tasks to other languages and domain-specific systems is a priority.
Human and LLM-rubric evaluation of open-ended, multi-step reasoning tasks retains inherent subjectivity; avenues for integrating dynamic code execution or formal verification are under active exploration (Peng et al., 18 Sep 2025, Cai et al., 17 Mar 2026).

7. Relevance to Broader Software Engineering Benchmarks

SWE-QA and SWE-QA-Pro complement and extend prior code understanding benchmarks (e.g., SWE-bench Verified, CoSQA, CodeQueries) by targeting long-range, repository-wide reasoning rather than atomic patch generation or file-level classification. The design philosophy underlying SWE-QA—emphasizing complex, authentic developer information needs as revealed by real-world issues—has influenced downstream benchmarks and training recipes. For example, Kimi-Dev employs skill priors derived from agentless training on tasks related to localization, code edit, and self-reflection, and efficiently adapts to agentic frameworks via SFT on multi-turn trajectories, demonstrating performance on par with leading proprietary models (Yang et al., 27 Sep 2025).

In sum, SWE-QA and its variants establish a rigorous, extensible standard for evaluating and advancing LMs' and agentic frameworks' ability to serve as intelligent assistants for real-world repository-level code comprehension and automation tasks.

Markdown Report Issue Upgrade to Chat

References (3)

SWE-QA: Can Language Models Answer Repository-level Code Questions? (2025)

SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding (2026)

Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SWE-QA Benchmark.