BinSeek: Two-Stage Cross-Modal Binary Code Retrieval
- BinSeek is a cross-modal retrieval system that maps natural language queries and binary pseudocode into a shared embedding space for semantic search.
- It employs a two-stage architecture with an embedding module for rapid candidate pruning and a context-augmented reranker for refined ranking.
- The framework leverages LLM-synthesized training data to achieve state-of-the-art accuracy and low latency in binary vulnerability detection and malware analysis.
BinSeek is a two-stage, cross-modal retrieval framework for the semantic search of stripped binary functions using natural language queries. Its architecture and methodology are designed to address the retrieval challenges inherent to binary code, especially the absence of symbolic information typical in scenarios such as vulnerability detection and malware analysis. The framework achieves state-of-the-art accuracy and scalability by combining a compact contrastive embedder with a context-augmented reranker, supported by a large-scale, LLM-synthesized training set of natural language and pseudocode function pairs (Chen et al., 11 Dec 2025).
1. Two-Stage Architecture
BinSeek comprises two sequential models optimized for retrieval speed and semantic ranking.
Stage I: BinSeek-Embedding is a cross-modal Transformer encoder mapping both NL queries and decompiled binary pseudocode into a shared 1,024-dimensional vector space. It employs 8 Transformer layers (hidden size 1,024, 16 heads) with RMSNorm, SwiGLU activations, Grouped-Query Attention, frozen Qwen3 byte-level BPE token embeddings (vocab=151,669), and Rotary Positional Embedding (RoPE). Embeddings are extracted via last-token pooling. This stage rapidly prunes the candidate set from the corpus.
Stage II: BinSeek-Reranker extends the embedding backbone with 10 additional Transformer layers (total 18), culminating in a linear-sigmoid head providing a scalar relevance score for the query-candidate-context tuple. The input includes the NL query, the target function’s decompiled pseudocode, and its top-5 most “informative” callees (chosen based on a score aggregating the number of meaningful symbols, strings, named callees, and token density). This context augmentation enhances discriminative power for difficult cases. The reranker refines candidate order for the top-K (typically K=10) recalled in stage I.
2. LLM-Based Data Synthesis Pipeline
Training data for BinSeek is generated using a LLM-based pipeline designed for high coverage and quality:
- Data Collection and Alignment: Binary functions are compiled from 10,555 popular C/C++ GitHub repositories with gcc/clang (O0–O3) targeting x86, ARM, MIPS, and more. Stripped binaries (1.8M) are decompiled (IDA Pro) into 183.9M pseudocode functions. About 58% of functions can be mapped to their source counterpart using debug symbols.
- Semantic Description Generation: Function-level descriptions are synthesized with DeepSeek-V3 (671B parameters) using prompts giving the full source file, path, and project info, and requesting concise English and Chinese descriptions.
- Quality Filtering: Functions with ≤10 lines of code, short pseudocode, or “thunk”/wrapper roles are dropped. A DeepSeek-V3-based discriminator assigns quality ranks (A-D); only A/B are kept. MinHash deduplication ensures high sample diversity.
- Negative Sampling: For each positive (query, code) pair, negatives are drawn from different projects and further filtered using Qwen3-Embedding-8B such that sim(query, negative description) ≤ 0.95. This yields on average one high-quality negative per positive, leading to ≈45.7M training tuples.
3. Training Objectives and Model Optimization
Two separate losses optimize the retrieval and ranking stages:
- BinSeek-Embedding: Trained with the InfoNCE loss (a cross-modal contrastive objective):
where are the normalized NL and pseudocode embeddings, denotes cosine similarity, and is a temperature parameter.
- BinSeek-Reranker: Optimized using binary cross-entropy:
is the relevance probability score; is the true label.
4. Benchmarking, Datasets, and Evaluation Protocol
The framework introduces a dedicated benchmark with two held-out, zero-overlap test sets:
- Embedding Test Set: 400 queries, each paired with a pool of 10,000 functions (1 positive, 9,999 hard negatives, filtered for maximum description similarity <0.95).
- Reranking Test Set: 200 queries formed by taking top-10 embedding candidates where the correct function is present but not ranked 1.
Standard metrics include Recall@K and MRR@K:
5. Performance and Comparative Evaluation
Empirical results establish BinSeek’s superiority over same-size and substantially larger baselines.
Embedding Stage (Recall/Accuracy):
| Model | Size | Rec@1 | Rec@3 | Rec@10 | MRR@3 | MRR@10 |
|---|---|---|---|---|---|---|
| SFR-Embedding-Mistral | 7 B | 60.5 | 69.5 | 77.5 | 64.7 | 66.2 |
| Qwen3-Embedding-8B | 8 B | 57.5 | 65.0 | 73.5 | 60.8 | 62.1 |
| BinSeek-Embedding | 0.3 B | 67.0 | 80.5 | 93.5 | 72.8 | 75.2 |
Reranker Stage:
| Model | Size | Rec@1 | Rec@3 | MRR@3 |
|---|---|---|---|---|
| Qwen3-Reranker-8B | 8 B | 62.5 | 80.5 | 70.8 |
| BinSeek-Reranker | 0.6 B | 61.5 | 83.0 | 70.5 |
End-to-End Pipeline:
| Pipeline | Rec@1 | Rec@3 | Rec@10 | MRR@3 | MRR@10 | Time/query (m) |
|---|---|---|---|---|---|---|
| gemma-300m + Qwen3-Reranker-0.6B | 49.0 | 53.1 | 58.8 | 53.1 | 53.3 | 1.9 |
| SFR-Embedding-Mistral + Qwen3-8B | 73.8 | 77.8 | 78.3 | 75.6 | 75.7 | 20.9 |
| BinSeek (0.3 B + 0.6 B) | 76.8 | 84.5 | 93.0 | 80.3 | 81.5 | 1.8 |
BinSeek achieves a +31.42 pp gain in Rec@3 and +27.17 pp in MRR@3 over the same-size baseline, and outperforms much larger models by 6–16 pp Rec@3/10 while using 10× less end-to-end latency (Chen et al., 11 Dec 2025).
6. Contextual Augmentation, Ablations, and Future Work
Ablation studies confirm the value of context augmentation: adding the top-5 informative callees raises Rec@3 by ~3 pp and MRR@3 by ~2 pp. Meaningful-symbol density consistently surpasses random neighbor sampling by 4+ pp. Context remains limited to local callees; deeper interprocedural context is not yet explored.
Potential avenues include scaling BinSeek to multi-billion parameter regimes to study scaling behavior in binary domains and developing richer cross-project negative sampling protocols. Limitations include local context-augmentation and possible missed “near-miss” binaries during negative selection.
7. Connections to Related Binary Retrieval Systems
BinSeek is distinct as a cross-modal (NL ↔ binary) retrieval system specialized for stripped binaries. Existing approaches such as FASER leverage cross-architecture IRs (radare2 ESIL) and long-context transformers for function-to-function retrieval, with strong cross-architecture and cross-optimization robustness but no NL query support (Collyer et al., 2023). Systems like BinSeeker employ two-stage semantic learning (structure2vec graph embeddings on labeled semantic flow graphs) plus lightweight emulation (VEX-IR event signatures), offering high MRR (0.65) and high top-5 accuracy (93.3%) for cross-platform vulnerability search (Gao et al., 2022). By contrast, BinSeek directly addresses the NL–binary retrieval axis and the practical demands of real-time, at-scale deployment in agent frameworks.
A plausible implication is that BinSeek’s architectural model may generalize to tasks such as binary code comment generation or fine-grained vulnerability retrieval once multi-modal benchmarks and training data become available. Integration with IR-based or GNN-based embeddings, and extension to wider call-graph context, remain active research topics.