On the Challenges and Opportunities of Learned Sparse Retrieval for Code

Published 23 Mar 2026 in cs.IR and cs.CL | (2603.22008v1)

Abstract: Retrieval over large codebases is a key component of modern LLM-based software engineering systems. Existing approaches predominantly rely on dense embedding models, while learned sparse retrieval (LSR) remains largely unexplored for code. However, applying sparse retrieval to code is challenging due to subword fragmentation, semantic gaps between natural-language queries and code, diversity of programming languages and sub-tasks, and the length of code documents, which can harm sparsity and latency. We introduce SPLADE-Code, the first large-scale family of learned sparse retrieval models specialized for code retrieval (600M-8B parameters). Despite a lightweight one-stage training pipeline, SPLADE-Code achieves state-of-the-art performance among retrievers under 1B parameters (75.4 on MTEB Code) and competitive results at larger scales (79.0 with 8B). We show that learned expansion tokens are critical to bridge lexical and semantic matching, and provide a latency analysis showing that LSR enables sub-millisecond retrieval on a 1M-passage collection with little effectiveness loss.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces SPLADE-Code, the first large-scale learned sparse retrieval model for code, enhancing efficiency and interpretability in code search.
The method uses max-pooled projection with KLD distillation and a FLOPs penalty to yield sparse representations and sub-millisecond latency.
Empirical results show robust generalization with nDCG@10 improvements across in-domain and out-of-domain benchmarks, supporting cross-language retrieval.

Learned Sparse Retrieval for Code: SPLADE-Code Analysis

Context and Motivation

Code retrieval over large repositories has become integral to LLM-powered software engineering systems, underpinning tasks such as component reuse, library integration, and agentic workflows. Traditionally, neural code retrieval has favored dense embedding models. However, learned sparse retrieval (LSR)—particularly SPLADE-based approaches—offers advantages in interpretability, latency, and generalization, yet faces unique challenges when applied to code due to subword fragmentation, semantic gaps between queries and code, multi-language diversity, and document length.

SPLADE-Code: Methodology and Model Architecture

SPLADE-Code is the first large-scale LSR model for code retrieval, trained via a lightweight single-stage pipeline and ranging from 600M to 8B parameters. It employs a max-pooled projection onto the backbone LLM vocabulary to generate highly sparse bag-of-words representations for both queries and code documents. Training leverages KLD distillation from cross-encoder teachers, encouraging the student model to match relative ranking distributions rather than binary relevance labels, thus facilitating robustness across text-to-code, code-to-text, and code-to-code paradigms. To control representation density—which correlates strongly with input length and impacts index efficiency—a FLOPs penalty regularizer is introduced, encouraging batch-level sparsity and reducing posting list size.

SPLADE-Code models employ bi-directional attention, which is crucial for maximizing lexical matching and semantic abstraction, particularly when handling heterogeneous query-document pairs across languages and modalities. Model merging is performed using weighted spherical averaging over multiple checkpoints, including context length variations.

Empirical Evaluation

SPLADE-Code was rigorously evaluated on prominent benchmarks:

In-Domain: CoIR, MTEB-code (spanning >14 languages and multiple retrieval task formulations).
Out-of-Domain: CodeRAG (retrieval-augmented generation), CPRet (competitive programming retrieval).

Retrieval Performance

Numerical results indicate that SPLADE-Code matches or exceeds dense retrieval baselines at all model scales, particularly when comparing models trained under matched data and configurations. SPLADE-Code-8B achieves nDCG@10 scores of 79.0 on MTEB Code, outperforming strong dense and baseline LSR variants. Lexical-only retrieval (BM25, SPLADE-lexical) is consistently inferior due to its inability to bridge semantic gaps and leverage expansion tokens. SPLADE-Code shows robust generalization to out-of-domain benchmarks, often surpassing top dense models by notable margins (up to +7 nDCG@10 points), substantiating claims of domain robustness.

Latency and Efficiency

Pruning strategies within SPLADE-Code establish sub-millisecond latency (under 1 ms per query on a 1M code passage collection), with marginal effectiveness loss. A two-step retrieval paradigm—aggressive initial pruning followed by less restrictive refinement—balances efficiency and effectiveness, constituting a practical solution for LLM agentic systems where latency constraints are stringent. Dense models, in comparison, rely on HNSW search (via FAISS), which is scalable but less performant for long documents encountered in code retrieval.

Interpretability and Representation Analysis

LSR models inherently facilitate interpretability: SPLADE-Code's sparse vectors activate only a small subset of vocabulary tokens, directly mapping to semantic and lexical constituents. Expansion tokens, absent from input but learned during training, are pivotal in bridging lexical and semantic matching. For instance, queries about sorting algorithms activate tokens such as "quicksort," "pivot," "sort,"—even if unmentioned—while code snippets are mapped through shared English pivots, enabling cross-language matching. Approximately 65% of top activations are expansion terms, underscoring their role in retrieval abstraction.

Ablation Studies and Design Choices

Performance is stable across various LLM backbones (generalist, code-specialized), indicating SPLADE-Code's robustness to projection space alterations. Intermediate English fine-tuning, beneficial for dense retrieval, reduces SPLADE-Code's effectiveness, supporting the hypothesis that sparse models are sufficiently equipped for lexical matching without additional bridging. Instruction-tuning has minimal impact, while sparse autoencoder (SPLARE) variants achieve competitive results, suggesting matching spaces are largely governed by shared natural language representations.

Implications and Future Directions

SPLADE-Code sets a precedent for LSR in code retrieval, offering a clear alternative to dense embeddings with strong effectiveness, interpretability, and latency profiles. Its generalization properties and semantic expansion mechanisms are applicable to multilingual, cross-modal, and domain-variant retrieval tasks. Integration into agentic LLM frameworks could enable efficient, interpretable tool interactions, repository exploration, and software development workflows.

Theoretical implications include the viability of vocabulary-based sparse representations for highly structured, semantic domains such as code, challenging the notion that dense embeddings alone are sufficient. Practically, SPLADE-Code's latency and interpretability advantages recommend its adoption in retrieval-heavy AI coding systems, supporting real-time, explainable agent behaviors.

Open challenges include further external validation on real-world developer queries, repository-level idiosyncrasies, and interactive debugging contexts. End-to-end impact assessments involving reranking, generation, and tool use remain essential for comprehensive evaluation. Direct latency comparisons should account for system architecture differences (CPU vs. GPU search paradigms).

Conclusion

SPLADE-Code demonstrates that learned sparse retrieval is both effective and efficient for code retrieval across diverse tasks and benchmarks, with robust generalization and interpretability. Expansion tokens and domain abstraction are central to its performance, and the model presents strong prospects for integration into next-generation AI software engineering systems. Future work will further explore its applicability to complex retrieval contexts and its interplay with downstream agentic operations.

Markdown Report Issue