FuseSearch: Hybrid Retrieval & Code Localization

Updated 3 February 2026

FuseSearch is a framework that fuses adaptive parallel exploration with attribute-vector filtering to jointly optimize retrieval precision and computational efficiency.
It employs supervised fine-tuning and reinforcement learning to optimize code localization by dynamically adjusting tool calls and reducing redundancy.
For ANN retrieval, FuseSearch embeds content and attribute features into a unified space, achieving high throughput and recall in hybrid search tasks.

FuseSearch refers to two distinct frameworks addressing joint optimization in high-throughput, information-dense search: (1) adaptive parallel code localization in automated software development pipelines (Xu et al., 27 Jan 2026) and (2) attribute-vector filtered approximate nearest neighbor (ANN) retrieval via convexified fusion (Heidari et al., 24 Sep 2025). Both leverage “fused” objective formulations to overcome speed-quality trade-offs in their domains, optimizing across both retrieval precision and computational or informational efficiency.

1. Problem Definitions and Motivations

Code Localization

FuseSearch for code localization targets the task: Given an issue description $q$ and a large codebase, identify the minimal set of files and functions to modify to resolve the issue. Traditional sequential agents make a single tool call per step (e.g., grep, glob, read_file), leading to information starvation under tight interaction budgets—agents quickly deplete allowed turns without sufficient evidence, causing severe accuracy degradation. Naive parallelism—issuing a fixed number $k$ of tool calls per turn—alleviates context starvation but exhibits a 34.9% redundant invocation rate due to duplicate explorations, squandering compute and introducing noise that harms localization quality (Xu et al., 27 Jan 2026).

Hybrid Attribute-Vector ANN

FuseSearch in hybrid ANN settings addresses filtered nearest neighbor queries, where both content vector similarity and attribute constraints (e.g., category, date) are imposed: For a database $\mathcal{D}$ with each $o_i$ represented by $v(o_i) \in \mathbb{R}^d$ and attributes $f^j(o_i)$ , find the top- $k$ objects closest in vector space while satisfying $f^j(o) = F^j_q$ (categorical) or $f^j(o) \in [l,u]$ (range). Current solutions resort to staged filtering or index “hacks” (disjoint attribute and ANN indices), incurring recall/speed overhead and failing under high filter cardinality or low-selectivity queries (Heidari et al., 24 Sep 2025).

2. Formalization of Joint Quality–Efficiency Objectives

Tool Efficiency and Reward Structures

The code localization FuseSearch defines:

Precision, Recall, F1:

$P = \frac{|\hat{A} \cap A|}{|\hat{A}|}, \quad R = \frac{|\hat{A} \cap A|}{|A|}, \quad F_1 = \frac{2PR}{P + R}$

where $\hat{A}$ is the predicted set, $A$ the ground truth.

Per-call Information Gain $g_i$ :

$g_i = \begin{cases} \frac{|\mathcal{E}_i \setminus \mathcal{H}|}{|\mathcal{E}_i|} & |\mathcal{E}_i| > 0 \ 0 & \text{otherwise} \end{cases}$

with $\mathcal{H}$ as the set of code entities previously seen and $\mathcal{E}_i$ the $i$ th tool call’s output.

Tool Efficiency $e(\tau) = \frac{1}{k} \sum_{i=1}^k g_i$ , $k$ the number of tool calls in trajectory $\tau$ .
Reward: FuseSearch employs

$R(\tau) = \alpha F_1(\tau) + \gamma (F_1(\tau) \cdot e(\tau))$

with $\alpha = 0.8$ , $\gamma = 0.2$ , and $F_1$ the weighted sum of file-level and function-level $F_1$ .

Hybrid Attribute-Vector Fusion Objective

The filtered ANN formulation proceeds from a lexicographic multicriteria selection, then relaxes it as:

$L(o; \alpha, \beta) = \sum_{j=1}^F \alpha_j \sigma_j(f^j(o), F^j_q) + \beta \rho(v(q), v(o))$

where $\sigma_j$ is the attribute comparison (e.g., 0/1 for categorical), $\rho$ the vector distance. The fused nearest neighbor is approximated by minimizing $L(o)$ , and filter/vectors are embedded such that fused Euclidean norm reflects this penalized sum.

3. Training and Algorithmic Foundations

Adaptive Parallel Execution for Localization

FuseSearch’s code localization policy is learned in two phases (Xu et al., 27 Jan 2026):

Supervised Fine-Tuning (SFT): Trajectories are synthesized via a teacher model guiding 2–8 tool calls per turn across ~21K GitHub issue–patch pairs. Joint filtering on $F_1$ and $e$ produces ~6K demonstrations, optimizing cross-entropy over JSON-structured tool predictions.
Reinforcement Learning (RL): The Group Relative Policy Optimization (GRPO) algorithm maximizes the hybrid reward, regularized toward the SFT policy by a KL term ( $\beta = 0.01$ ) to prevent catastrophic drift.

Adaptive inference policy dynamically determines both the number and type of tool calls; in early turns the policy executes broad parallel exploration, tapering to focused refinement as uncertainty collapses.

Fused ANN Embedding and Indexing

For hybrid ANN, FuseSearch constructs a fused space embedding $\Psi(v, f; \alpha, \beta)$ via blockwise or neural composition:

$\Psi(v, f; \alpha, \beta) = \left[ \frac{v^{(1)} - \alpha f}{\beta}, \ldots, \frac{v^{(d/m)} - \alpha f}{\beta} \right]$

For general joint-embedding, neural encoders $E^c$ , $E^f$ are concatenated and processed through transformer layers, with a penalized NN loss enforcing margin ordering in the fused objective:

$\ell = \max \{0, d_{\text{fuse}}(q, o^+) - d_{\text{fuse}}(q, o^-) + \text{margin}\}$

Indices (HNSW, IVF) are built on these fused vectors for both attribute and vector search, with candidate set sizing regulated by theoretical bounds to preserve top- $k$ and recall guarantees.

4. Implementation Details

Code Localization Implementation

Backbones: Qwen3-4B-Instruct, Qwen3-30B-A3B-Instruct.
Tools: grep, glob, and read_file—language-agnostic and read-only.
Infrastructure: SFT utilizes 8 × NVIDIA H20 GPUs; RL utilizes 32 × NVIDIA H20 GPUs leveraging vLLM and RLLM/veRL. Parallel tool invocation is implemented via shell/backend concurrency without inter-call synchronization.

ANN Indexing Procedures

Fused Single-Attribute Index:
- Each $o_i$ is encoded as $v_i' = \Psi(v(o_i), f(o_i); \alpha, \beta)$ .
- Insert into HNSW/IVF index.
Query-Time:
- Query embedded as $\Psi(v(q), F_q; \alpha, \beta)$ , retrieve top $k'$ candidates, rerank by $L(o; \alpha, \beta)$ for hybrid objective.
Range Filtering: Queries over attribute intervals are handled by embedding query/attribute pairs as line segments in fused space, with a two-level index structure (angular direction and midpoint) supporting efficient cylinder-based search and precise radius adjustments.

5. Empirical Results and Analysis

Code Localization

On SWE-bench Verified (386 issues, "new file/function" patches excluded):

Model	File $F_1$	Function $F_1$	Efficiency $e$	Turns	Time (s)	Tokens (k)
Base, parallel	64.50%	38.91%	59.5%	4.24	6.12	47.9
FuseSearch	84.65%	56.43%	69%	4.78	5.43	30.9
RepoSearcher	38.12%	—	—	—	—	—

FuseSearch demonstrates state-of-the-art performance, improving file-level $F_1$ by +20 percentage points and function-level $F_1$ by +17 points. Search time drops by 93.6%, turns by 67.7%, and token consumption by 68.9% relative to strong baselines.

Ablation studies reveal:

SFT alone: Increases quality but less efficient.
RL: Modest additional gain without prior SFT.
Sequential vs Parallel: Parallelized SFT+RL halves the search cost and token usage against sequential execution.
Reward Structure: Combined $F_1 + F_1·e$ yields the best joint quality and efficiency.

The learned policy exhibits high initial parallel breadth (≈5 calls/turn), transitioning to refinement (≈2 calls/turn).

Hybrid ANN Search

Benchmarks across SIFT1M, GloVe-1.2M, DEEP, YouTube-Audio, and WIT-Image:

Benchmark	Task	QPS Improvement	Recall@10
SIFT1M, single attr	Ann-Hybrid	4.2× over NHQ	≈0.95
SIFT1M, multi-attr	Ann-Hybrid	3.2× over NHQ	≈0.95
DEEP, range filter	Ann-Range	4–6× over SeRF	≈0.95
YouTube-Audio, range	Ann-Range	7–13× ANNS-first	≈0.95

FuseSearch maintains throughput above $10^5$ QPS in multi-attribute scenarios where alternatives drop below $10^3$ (Heidari et al., 24 Sep 2025).

6. Theoretical Guarantees and Parameterization

Order-Preserving: Fused space preserves exact content ranking among objects with matching attributes.
Exact-Filter Limit: As attribute penalty $\alpha \to \infty$ with bounded $\beta$ , FuseSearch reduces to strict attribute filtering.
Approximation: Given an ANN engine returning candidates within $(1+\epsilon_{\text{fuse}})$ of optimal fused distance, the same relative error holds for the penalized hybrid objective.
Parameter Selection: To maintain intra-cluster compactness and inter-cluster separation, set $\beta \geq \delta_{\max}/\epsilon_f$ and $\alpha$ per derived bounds, where $\delta_{\max}$ and $\sigma_{\min}$ are maximal content and minimal attribute distances, respectively.

7. Limitations and Future Perspectives

Code localization ground truth comprises a single “golden patch”; supporting multiple valid solutions is an open evaluation challenge.
Current datasets are biased toward Python; performance on static languages requires further study.
FuseSearch in code localization settings employs a minimal set of read-only tools; extension to semantic/static analysis (AST, type inference) is plausible for future gains.
For ANN, the approach assumes sufficient embedding capacity and ANN index scalability; range-filtering and multi-filter heuristics can be further refined for extreme-scale, low-selectivity, or high-dimensional scenarios.

FuseSearch establishes a unified, theoretically principled paradigm for fusing parallel exploration and constraint satisfaction, achieving high retrieval quality and efficiency across both automated software localization (Xu et al., 27 Jan 2026) and hybrid similarity search (Heidari et al., 24 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Learning Adaptive Parallel Execution for Efficient Code Localization (2026)

FusedANN: Convexified Hybrid ANN via Attribute-Vector Fusion (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FuseSearch.

FuseSearch: Hybrid Retrieval & Code Localization

1. Problem Definitions and Motivations

Code Localization

Hybrid Attribute-Vector ANN

2. Formalization of Joint Quality–Efficiency Objectives

Tool Efficiency and Reward Structures

Hybrid Attribute-Vector Fusion Objective

3. Training and Algorithmic Foundations

Adaptive Parallel Execution for Localization

Fused ANN Embedding and Indexing

4. Implementation Details

Code Localization Implementation

ANN Indexing Procedures

5. Empirical Results and Analysis

Code Localization

Hybrid ANN Search

6. Theoretical Guarantees and Parameterization

7. Limitations and Future Perspectives

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics