BinSeek-Reranker: Enhanced Binary Code Retrieval

Updated 18 December 2025

The paper introduces BinSeek-Reranker, a reranking module that leverages an 18-layer transformer and explicit calling-context augmentation to enhance ranking accuracy in binary analysis.
It employs innovative techniques like RMSNorm, SwiGLU, and rotary position encoding to process long pseudocode sequences and bridge the semantic gap of stripped binaries.
Empirical evaluations indicate significant improvements with up to 27% increased recall and 10× faster inference compared to standard dual-encoder and large-model baselines.

BinSeek-Reranker is the reranking module in the BinSeek framework, a state-of-the-art cross-modal retrieval system optimized for stripped binary code analysis. Designed in response to the challenge of mapping natural language (NL) security queries to relevant stripped binary functions—where most symbolic and surface-level cues are absent—BinSeek-Reranker leverages deep transformer-based cross-encoding and explicit calling-context augmentation to significantly improve both recall and ranking accuracy of candidate matches. The system demonstrates substantial gains in retrieval precision over prior dual-encoder and general-purpose LLM rerankers, while preserving practical inference latency and scalability for software security workflows (Chen et al., 11 Dec 2025).

1. Problem Setting and Motivation

The core application for BinSeek-Reranker is cross-modal retrieval in stripped binary analysis, where the objective is to match NL queries—such as vulnerability signatures, malware indicators, or functional descriptions—to functions within large binary codebases that lack debug and symbolic information. In this setting, single-function decompilation yields pseudocode with limited recoverable semantics, due to obfuscation and loss of names/strings typical after compilation and stripping. While high-throughput retrieval architectures (i.e., dual-encoder models) enable coverage of thousands of functions, they frequently mis-rank semantically relevant but contextually ambiguous candidates. BinSeek-Reranker is designed as the second stage of BinSeek, explicitly to address the semantic gap left by the embedding stage through fine-grained, context-aware judgment (Chen et al., 11 Dec 2025).

2. Model Architecture and Context Augmentation

BinSeek-Reranker is implemented as an 18-layer transformer-based cross-encoder, with architectural innovations to accommodate both binary pseudocode and NL tokens. The model architecture includes RMSNorm, SwiGLU, grouped-query attention, and rotary position encoding (RoPE), and is deepened from the 8-layer BinSeek-Embedding base to enhance context fusion and alignment (§ 3.2).

The input sequence tokenization operates at the byte-level BPE granularity (vocab size 151,669, Qwen3-compatible). The packed input consists of:

[CLS] Query tokens [SEP]
Decompiled pseudocode of the target function $f_t$ [SEP]
Pseudocode of up to $k=5$ selected callee functions $f_c^1, \dots, f_c^k$ [SEP]

The total sequence length may reach 16,384 tokens, accommodating substantial context.

Context selection is governed by the score $\mathcal{S}(f_t)$ , aggregating name presence, string literal density, and callee information: $\mathcal{N}(f) = \begin{cases} 1, & \text{if %%%%4%%%% has an unstripped name} \ 0, & \text{otherwise} \end{cases}$

$\mathcal{S}(f_t) = \mathcal{N}(f_t) + \sigma\Big(\beta \cdot \frac{|T_{\mathrm{str}}|}{|T_{\mathrm{code}}|}\Big) + \frac{1}{n}\sum_{i=1}^n\mathcal{N}(f_c^i)$

where $\sigma$ denotes the sigmoid, $\beta=15$ scales the string fraction so that approximately 7.3% strings yields a $+0.5$ bonus, and callee name presence is averaged. Callees are sorted by $\mathcal{S}$ and the top-5 are included.

This context augmentation allows the cross-encoder to directly attend to both query and relevant code tokens across function boundaries, overcoming the information deficit characteristic in stripped binaries.

3. Scoring Function and Training Objective

The reranker processes the concatenated NL–code sequence through the transformer stack, with all tokens participating in self-attention, enabling token-level cross-modal interaction. The final hidden state of the special [CLS] token (or equivalent pooling) is passed to an LM head, projecting to a scalar logit $\ell$ : $s(q, f) = \sigma(\ell) \in (0, 1)$ where $\sigma$ is the sigmoid function.

Training uses binary cross-entropy loss over mixed positive (relevant) and negative (irrelevant) query–function pairs, sampled to include both random and "hard" negatives (the latter have low source-description cosine similarity to the query, as measured by Qwen3-Embedding-8B). The loss for a batch of $M$ samples is: $\mathcal{L}_\text{reranker} = -\frac{1}{M} \sum_{i=1}^M\left[y_i\log p_i + (1-y_i)\log(1-p_i)\right]$ where $p_i = s(q_i, f_i)$ and $y_i \in \{0, 1\}$ is the ground-truth label.

Regularization is limited to standard weight decay (AdamW) and transformer dropout; no additional penalties are applied.

4. Data Synthesis and Training Pipeline

The BinSeek training regime is underpinned by a large-scale LLM-aided data synthesis pipeline:

$\sim$ 10,555 open-source C/C++ projects are compiled, binaries are stripped, then decompiled via IDA Pro.
Debug symbols are preserved until after decompilation, allowing correct alignment between pseudocode fragments and source function boundaries.
For each (binary, source) pair, DeepSeek-V3 (671B-parameter LLM) generates a single-sentence NL description, which is further quality-ranked (A-D), retaining only A/B labels. Semantically duplicate functions are deduplicated with MinHash (0.95 threshold).
Negative examples for the reranker are sourced both randomly and by "hard negative" mining. Each training tuple includes the code and top-5 callees (by $\mathcal{S}$ ), always accompanying the query. This leads to a training set exceeding 45.7 million positive and negative pairs.

This large, highly structured corpus is central to the model’s cross-modal alignment and generalizability in stripped-binary domains.

5. Inference Procedure and System Integration

At inference, BinSeek operates in two stages:

Candidate Generation (Stage 1):
- The NL query is embedded via BinSeek-Embedding.
- All pseudocode functions in the target codebase, previously embedded and indexed, are ranked by cosine similarity against the query embedding.
- The top- $N$ (commonly $N=10$ ) candidates proceed to reranking.
Contextual Cross-Encoder Reranking (Stage 2):
- For each candidate $f^{(j)}$ , its top-5 callees (by $\mathcal{S}$ ) are selected and included in the input sequence.
- The cross-encoder computes $s_j = \sigma(\ell_j)$ for each packed [query, function, callees] sequence.
- Candidates are sorted by $s_j$ to yield the final ranking.

A high-level inference pseudocode is:

candidates = EmbeddingModel.retrieve(q, N)
for f in candidates:
    context = select_top5_callees(f)
    input_seq = [q] ▷ [f] ▷ context
    ℓ = Reranker.forward(input_seq)
    score[f] = sigmoid(ℓ)
return sort_descending(candidates, by=score)

6. Empirical Evaluation and Comparative Performance

BinSeek-Reranker achieves substantial improvements over both same-size and much larger general reranking models:

Model	Rec@1 (%)	Rec@3 (%)	MRR@3 (%)	Notes
BinSeek-Reranker (0.6B reranker, 0.3B embed)	76.75	84.50	80.25	Pipeline result §4, Table 4
General reranker pipeline (embeddinggemma + Qwen3)	--	53.08	63.08	BinSeek +31.42% Rec@3, +27.17% MRR@3
SFR-Embedding-Mistral + Qwen3-8B (non-tailored)	--	77.75	75.62	BinSeek +6.75% Rec@3, +4.63% MRR@3

Reranker-only evaluation (reranking top 10 embedding candidates, Table 3) further demonstrates that BinSeek-Reranker (0.6B) reaches Rec@1 = 61.50% (27% above Qwen3-0.6B and within 1% of Qwen3-8B), Rec@3 = 83.00% (exceeding Qwen3-8B’s 80.50%), and MRR@3 = 70.50% (Chen et al., 11 Dec 2025).

Latency per end-to-end query is ≈1.76 minutes, approximately $10\times$ faster than large-model baselines.

7. Limitations and Insights

The current BinSeek-Reranker is limited in its context window (16k tokens), restricting augmentation to local callees and omitting callers or broader call-graph context. Model capacity is 0.6B parameters, with plausible future improvements anticipated from scaling. The reranking module substantially outperforms dual-encoder methods by leveraging calling-context—particularly valuable due to the patchy distribution of residual names, string literals, and library calls in stripped code. Attention fusion across both query and expanded code fragments appears essential for closing the semantic gap in NL-to-binary matching when single-function pseudocode is insufficient (Chen et al., 11 Dec 2025).

A plausible implication is that cross-modal reranking with local context augmentation can push retrieval performance for binary analysis tasks beyond what is feasible with purely embedding-based or non-cross-encoder pipelines. This design enables accurate, scalable retrieval for security analysts and automated LLM-based agent scenarios.

PDF Markdown Chat (Pro)

References (1)

Cross-modal Retrieval Models for Stripped Binary Analysis (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to BinSeek-Reranker.