Papers
Topics
Authors
Recent
2000 character limit reached

RepoNavigator: Efficient Issue Localization in OSS

Updated 30 December 2025
  • RepoNavigator is a repository-level LLM agent designed for precise issue localization in extensive open-source software repositories.
  • It employs a single, execution-aware jump tool that leverages static analysis to mimic Python’s dynamic name resolution while reducing tool complexity.
  • Empirical benchmarks demonstrate that RepoNavigator, optimized with reinforcement learning, outperforms multi-tool agents and larger LLMs in localization metrics.

RepoNavigator is a repository-level LLM agent architected for efficient and precise issue localization in large open-source software (OSS) repositories. It is distinguished by its exclusive use of a single, execution-aware "jump" tool that statically resolves symbol definitions, combined with end-to-end training via reinforcement learning (RL) directly from pretrained open-source LLMs. RepoNavigator achieves superior localization performance relative to multi-tool agents and even against much larger closed-source LLMs, due to its integrated RL objective and a minimal, code-execution-native tool interface (Zhang et al., 24 Dec 2025).

1. Design Principles and Execution-Aware "Jump" Tool

RepoNavigator is motivated by the observation that real Python code execution is guided by the dynamic resolution of names to their definitions, not by syntactic constructs such as "SearchClass". Accordingly, RepoNavigator eliminates auxiliary retrieval tools in favor of a single JSON-constrained tool, jump(symbol, file_path), which internally leverages Pyright’s static analysis for symbol resolution. The tool operates as follows:

  • Abstract Syntax Tree (AST) parsing locates each syntactic occurrence of a symbol.
  • Lexical lookup adheres to Python’s LEGB (Local, Enclosing, Global, Builtin) resolution chain.
  • Static type inference constructs a union type T(a)T(a) for receiver expressions and dispatches member lookups via the method resolution order (MRO): resolve(a.b)=tT(a)lookup(b,MRO(t))\text{resolve}(a.b) = \bigcup_{t \in T(a)} \text{lookup}(b, \text{MRO}(t)).
  • The import graph is traversed to resolve cross-file bindings, re-exports, and to account for __all__ filtering.

This jump tool deterministically returns the exact file(s) and code span(s) implementing the queried symbol, closely mirroring actual runtime behavior.

2. Reinforcement Learning Algorithm and Reward Signal

RepoNavigator is trained via Group Reference Policy Optimization (GRPO), with the state at each timestep sts_t comprising the complete dialogue history (q,o1:t1,a1:t1)(q, o_{1:t-1}, a_{1:t-1}), where qq is the issue description, oio_{i} are observations, and aia_{i} are prior actions (reasoning tokens or tool calls). Actions ata_t are selected from natural-language reasoning or syntactically constrained jump calls.

The reward at episode end is defined as

R(Y^,Y,τ)=DICE(Y^,Y)+S(τ)R(\hat{Y}, Y^*, \tau) = \text{DICE}(\hat{Y}, Y^*) + S(\tau)

where Y^\hat{Y} is the set of proposed locations, YY^* is the ground truth, DICE(Y^,Y)=2Y^YY^+Y\text{DICE}(\hat{Y}, Y^*) = \frac{2|\hat{Y} \cap Y^*|}{|\hat{Y}|+|Y^*|} is the DICE score, and S(τ)S(\tau) is the fraction of successful jump calls (penalizing malformed or failed attempts).

GRPO’s objective incorporates per-step advantage estimates A^t\hat{A}_t and a trust-region KL penalty:

LGRPO(θ)=E(st,at)πθold[πθ(atst)πθold(atst)A^tβDKL(πθold(st)  πθ(st))]L_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{(s_t,a_t) \sim \pi_{\theta_{\text{old}}}} \left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} \cdot \hat{A}_t - \beta \cdot D_\mathrm{KL}(\pi_{\theta_{\text{old}}}(\cdot|s_t) \|\; \pi_\theta(\cdot|s_t))\right]

3. Training Regimen and Inference Workflow

RepoNavigator is initialized from the pretrained Qwen2.5-Instruct model (7B, 14B, 32B parameters) with no supervised fine-tuning for tool use nor any closed-source distillation. Training utilizes approximately 4,000 Python issues from the SWE-smith benchmark, filtered by nonzero reward across up to 16 random rollouts per sample. For each sample, eight rollouts are generated with a temperature of 1.0 to promote exploration. Optimization is conducted for one epoch, with configurations such as 8×A100-80GB (7B) or 16×A100 (14B/32B), batch size 128, prompt plus response length capped at 10,240 tokens, and a learning rate of 10610^{-6}. During inference, greedy decoding is mandated to ensure prediction stability.

4. Agent Architecture and Execution Loop

RepoNavigator operates on a prompt containing both the issue qq and the code repository entry point, plus the tool schema. The core execution loop alternates between reasoning (emitting natural language), acting (issuing a jump call), and observing (receiving tool outputs) until termination, signaled by a final boxed output of the form file.py::func\boxed{\text{file.py::func}}. The agent's context window supports approximately 10,000 tokens; per jump, only the minimal code region for the symbol definition is included, ensuring context remains tightly focused. The language server surfaces precise code spans, obviating the injection of syntactically or semantically irrelevant code.

5. Empirical Benchmarks and Quantitative Performance

RepoNavigator is evaluated on multiple benchmarks: SWE-smith (training), SWE-bench-Verified (human-annotated, validation), and SWE-bench-Pro (generalization to newer issues). Baselines encompass both open-source and closed-source agents, including LocAgent, CoSIL, Agentless, Orcaloca, RepoSearcher (open-source, with distillation+RL on Qwen2.5), and Claude-3.7-Sonnet, Claude-4.5, GPT5-chat (closed-source up to ~70B parameters).

Key function-level benchmark results on SWE-bench-Verified (Sample-F1/IoU) are displayed in the following table:

Model Training-Free S-F1 / IoU +GRPO S-F1 / IoU
Qwen2.5-7B 16.19 / 15.46 27.49 / 26.43
Qwen2.5-14B 25.58 / 23.00 29.23 / 26.84
Qwen2.5-32B 27.12 / 25.16 34.09 / 32.30

RepoNavigator surpasses same-size baselines and outperforms larger closed-source models in function- and file-level localization metrics. File-level S-F1/IoU are elevated from approximately 42/41 (training-free) to approximately 67/65 (32B/+GRPO model).

6. Qualitative Analysis and Ablation Studies

RepoNavigator demonstrates high precision (30–37%) and balanced recall, yielding superior F1 and IoU compared to multi-tool agents prone to false positives. An Agentless repair backend with RepoNavigator+RL (Qwen2.5-14B) as front end increases test-passing patches from 10.12% to 15.03% and function-level IoU from 5.28% to 14.58%. There is a clear scaling law: increasing the permitted number of jump calls (up to 12) enhances performance both before and after RL.

Ablation studies reveal that direct RL (pure GRPO) from the pretrained backbone yields better outcomes than supervised learning (RFT-only) or hybrid RFT+GRPO. Reward functions augmented by the tool-call success rate S(τ)S(\tau) are more effective than those based solely on localization. Adding extra tools (e.g., GetClass, GetFunc, GetStruc) degrades or fails to significantly improve localization metrics (e.g., IoU drops from 24.28% with jump only to 21.44%–24.00% with additional tools). Theoretical analysis attributes these results to a reduced action/observation space, mitigating compounding errors, and enforcing higher per-call success (Psucc=ipiP_\text{succ} = \prod_i p_i for k=1k=1). The reachable scope via jumps is strictly smaller and more focused than full-repository retrieval, enhancing overlap with the gold localization set.

7. Scope, Limitations, and Future Directions

RepoNavigator currently supports Python exclusively through static analysis; its resolution system does not accommodate dynamic imports or monkey-patched state. Ground-truth localization is limited to a single golden patch per issue, disregarding alternative valid edit sites. The agent’s context window constrains the exploration depth, with extremely deep call chains susceptible to truncation.

Anticipated future developments include the extension of RepoNavigator’s methodology to other programming languages via language-specific servers, the incorporation of hybrid static–dynamic analysis for runtime features, and application of the one-tool+RL paradigm to software repair and test-generation tasks (Zhang et al., 24 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to RepoNavigator.