INFCODE-C++: Autonomous C++ Issue Resolution
- INFCODE-C++ is an autonomous, C++-aware multi-agent framework that resolves complex issues in large repositories by combining semantic retrieval and AST queries.
- It leverages a dual-retrieval architecture with intent-guided semantic matching and structured AST querying to effectively localize bugs in intricate C++ codebases.
- Empirical evaluations on the MultiSWE-bench-CPP benchmark show state-of-the-art performance, more than doubling resolution rates of prior systems.
INFCODE-C++ is an autonomous, C++-aware multi-agent framework designed for end-to-end issue resolution in large, statically typed C++ repositories. Addressing the limitations of lexical-retrieval and shallow code navigation techniques—commonplace in Python-oriented LLM agents—INFCODE-C++ integrates intent-guided semantic retrieval and deterministic abstract syntax tree (AST) queries. This dual-retrieval architecture targets the semantic and structural complexity characteristic of C++ projects, such as overloaded identifiers, nested namespaces, and deep template instantiations, which significantly hinder traditional LLM-driven agents. Evaluated on the MultiSWE-bench-CPP benchmark, INFCODE-C++ achieves state-of-the-art results, more than doubling resolution rates of prior systems (Dong et al., 20 Nov 2025).
1. System Architecture and Workflow
INFCODE-C++ is structured as a multi-agent pipeline interacting through four principal stages:
- Repository Parsing: Full-project code artifacts are indexed and embedded.
- Issue Reproduction (Reproducer Agent): Given a natural-language issue description, the agent synthesizes a reproducible test .
- Patch Generation (Patch Agent): Employs semantic intent retrieval and AST-structured search to localize defects and synthesize candidate patches .
- Patch Selection (Selector Agent): Prunes, behaviorally tests, and votes on patches to select .
Textual Collaboration Flow:
1 2 3 |
[Reproducer Agent] → generates t_D →
[Patch Agent] → {QueryCodeIntent, FindClass/FindFunction/GetInheritanceChain…} → {p_1…p_n} →
[Selector Agent] → Prune → Behavioral Test → Vote → p_final |
2. Semantic Code-Intent Retrieval
Intent Representation and Embedding:
All files, classes, and functions are embedded into a dense vector space using a pretrained C++-specific encoder during repository parsing. For an issue description , the derived query is embedded as . An intent index,
maps these code artifacts.
Similarity Scoring and Retrieval:
Artifacts are ranked by cosine similarity:
An approximate nearest-neighbor index (e.g., FAISS) enables retrieval of the top- relevant artifacts . By design, , sharply narrowing the context for downstream processing.
3. AST-Structured Querying
Global C++ AST Representation:
The codebase is parsed into , where denotes syntactic constructs (e.g., NamespaceDecl, ClassDecl, TemplateInstantiation), and captures relations such as containment, inheritance, overload sets, and function calls.
Deterministic Querying:
Within the semantic subset , the agent performs graph traversals and pattern matches to precisely locate defects, issuing queries such as:
- FindClass("Search")
- FindFunction({scope:"UI", name:"update", params:["int"]})
- GetInheritanceChain("DerivedClass")
- GetFunctionCalls("Database","insert")
Pseudocode Example:
1 2 3 4 5 6 7 8 |
// Find all overloads of foo in namespace Bar nodes = AST.FindFunction( scope = "Bar", name = "foo", allowOverloads = true ) for each n in nodes: print(n.sourceLocation, n.signature) |
4. Context Construction and Localization
Integration Process:
The candidate modules from semantic retrieval are refined by AST queries to yield , a subgraph implicated in the reported defect. Source locations are determined by:
Full source text for is extracted and concatenated, forming an input window for LLM-based patch synthesis.
Localization Strategy:
By limiting semantic retrieval to top-5 artifacts and restricting AST spans to a narrow window ( lines), the approach avoids both context over-approximation and under-approximation, maximizing the token budget efficiency for downstream synthesis.
5. Empirical Evaluation and Benchmark Results
Benchmark and Evaluation Protocol:
INFCODE-C++ is evaluated on 129 C++ issues from five major GitHub repositories (MultiSWE-bench-CPP), each accompanied by an issue description and regression test suite . A solution is valid if:
- The regression test passes for the candidate patch .
- (no behavioral regressions).
Resolution Rates:
| System | Resolution Rate (%) |
|---|---|
| INFCODE-C++ + GPT-5 | 25.58 |
| MOpenHands + Claude-3.7 Sonnet | 14.73 |
| MSWE-agent + Claude-3.7 Sonnet | 11.63 |
| MAgentless + Claude-3.7 Sonnet | 3.88 |
INFCODE-C++ exceeds the best prior system by 10.85 percentage points and more than doubles the performance of MSWE-agent. Stratified results (easy/medium/hard) show robust improvement in all categories (e.g., 50.00% on easy vs. 32.14% for next-best). The size of the improvement on 129 items indicates high statistical significance by binomial confidence interval analysis (Dong et al., 20 Nov 2025).
6. Ablation and Behavioral Analysis
Ablation Study:
Each major system component was removed in isolation. The following results quantify their marginal contributions:
| Configuration | Resolution Rate (%) | Δ vs. Full |
|---|---|---|
| Full system (GPT-5) | 25.58 | — |
| w/o semantic code-intent retrieval | 19.37 | -6.21 |
| w/o AST-structured querying | 17.05 | -8.53 |
| w/o Reproducer Agent | 20.16 | -5.42 |
| w/o Selector Agent | 22.48 | -3.10 |
The largest performance drop occurs when AST querying is removed, confirming its criticality for defect localization and patch synthesis in C++. Semantic retrieval also provides a substantial boost. Removal of either retrieval step increases LLM reasoning turns (from 28.1 to 35.3 or 45.3), indicating both efficiency and accuracy improvements.
Behavioral Breakdown:
- Reproduction success: 28.81%
- File-level localization: 55.10%
- Function-level localization: 42.10%
- End-to-end resolution: 25.58%
Majority of failures are attributable to reproduction and localization, affirming the necessity of combined semantic and structural retrieval.
7. Significance and Implications
INFCODE-C++ represents the first system to combine semantic code-intent embeddings with explicit AST-structured querying for C++ repair. Its two-pronged retrieval pipeline overcomes challenges unique to C++—such as deeply nested templates, type overloading, and intricate scoping—that degrade the effectiveness of approaches tuned for dynamically typed languages. The architecture demonstrates that language-aware retrieval and structural analysis are prerequisites for effective LLM-driven repair of complex, statically typed ecosystems.
Results on MultiSWE-bench-CPP set a new benchmark for autonomous C++ issue resolution, and ablation studies isolate the specific technical advances underlying this improvement (Dong et al., 20 Nov 2025). A plausible implication is that future work in multi-language LLM agents for code repair should adopt similar retrieval and defect localization strategies to extend state-of-the-art performance across statically typed domains.