Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 87 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 17 tok/s Pro

GPT-5 High 23 tok/s Pro

GPT-4o 102 tok/s Pro

Kimi K2 166 tok/s Pro

GPT OSS 120B 436 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Learned Sparse Retrieval (LSR)

Updated 28 September 2025

Learned Sparse Retrieval (LSR) is a neural retrieval method that generates high-dimensional sparse bag-of-words representations for efficient and semantically accurate search.
It leverages fine-tuned Transformer models along with learned term reweighting and expansion to balance semantic coverage with explicit sparsity.
LSR systems integrate neural outputs with traditional inverted index structures enabling fast, scalable, and interpretable search in production environments.

Learned Sparse Retrieval (LSR) designates a class of neural information retrieval systems that generate high-dimensional, lexically interpreted, and explicitly sparse bag-of-words representations for queries and documents. These representations—typically produced by fine-tuned Transformer models—can be stored and processed efficiently in conventional inverted index structures, thus aligning the advances of deep neural contextualization with the production efficiency of classic IR engines. LSR methods underpin many modern neural search pipelines and are characterized by their reliance on learned term reweighting, term expansion, and explicit regularization procedures. By fusing neural and lexical paradigms, LSR provides both state-of-the-art semantic retrieval accuracy and system-level efficiency for large-scale, real-time search tasks.

1. Architectural Foundations and Sparse Representation

Fundamentally, an LSR system encodes input text (query or document) into a high-dimensional sparse vector whose dimensions correspond to a selected vocabulary—commonly the WordPiece or other corpus-specific token inventory. Encoding typically leverages a Transformer backbone, augmented by a sparse projection head. Two principal head types are widely studied:

MLP head: Produces per-token weights for only tokens present in the input (non-expanding).
MLM head: Computes output logits for all vocabulary tokens, enabling the model to perform context-dependent expansion (activating additional tokens not present in the surface form).

The resulting vector is strictly sparse—most dimensions are zero—with sparse activations selected by either max-pooling (as in SPLADE-style models) or by explicit expansion/weighting rules. The document’s representation can be summarized formulaically as:

$w_j = \max_{i} \log \left( 1 + \operatorname{ReLU}(w_{i,j}) \right)$

where $w_{i,j}$ is the contextual relevance logit for token $j$ at position $i$ .

During retrieval, LSR relies on simple, interpretable dot products between sparse query and document vectors:

$\operatorname{score}(q, d) = f_Q(q) \cdot f_D(d) = \sum_{k \in V} w_q^k \cdot w_d^k$

where $V$ is the (possibly expanded) vocabulary.

2. Regularization and Sparsification Techniques

A central challenge in LSR is controlling the degree of sparsity, balancing semantic coverage (term expansion) and efficiency. The most common regularization objective is the FLOPS loss, given by:

$\operatorname{FLOPS} = \sum_{i=1}^{|V|} \left( \frac{1}{N} \sum_{j=1}^N w_{ji} \right)^2$

where $w_{ji}$ is the activation of token $i$ in example $j$ . This loss pushes down the activation frequency of tokens across the batch, enforcing global sparsity.

Variants and augmentations of this regularizer have been proposed:

DF-FLOPS (Porco et al., 21 May 2025): Adds a document-frequency-conditioned weight to the standard FLOPS loss, penalizing overuse of high-DF terms (which would otherwise produce long posting lists and high latency during retrieval).

$\ell_{\mathrm{DF-FLOPS}} = \sum_{t \in V} [ (w_t / N) \sum_{i=1}^N r_{i,t} ]^2$

IDF-aware FLOPS (Geng et al., 7 Nov 2024): Penalizes common terms more than rare ones, ensuring the model focuses on semantically informative tokens.

These regularizers are often complemented by static pruning (truncating representations to top- $k$ tokens after model output), further balancing quality and efficiency.

3. Advances in Vocabulary Configuration

The role of vocabulary in LSR models extends beyond lexical coverage to serve as the effective “representational specification” for the model’s output space (Kim et al., 20 Sep 2025). Empirical studies show:

Expanded vocabularies (e.g., 100K custom token sets via ESPLADE pretraining) confer higher representational granularity, yielding enhanced recall and MRR, especially under aggressive pruning.
Corpus-specific vocabularies (Yu et al., 12 Jan 2024)—tokenizer and vocabulary learned directly from the target corpus—result in both shorter posting lists (improved efficiency) and higher retrieval effectiveness.
Dynamic vocabularies (Nguyen et al., 10 Oct 2024): Augment the fixed token set with entities (e.g., Wikipedia concepts), addressing entity fragmentation and enabling tracking of evolving knowledge.

Initialization quality is critical; pretrained embeddings (e.g., EMLM from large web corpora) substantially outperform random weight initializations, even at similar vocabulary sizes.

4. System-Level Retrieval, Hybrid Scoring, and Index Traversal

LSR’s integration with inverted indexes enables the realization of efficient retrieval through both lexical and neural signals:

Hybrid scoring (Qiao et al., 2022, Mallia et al., 2022): Combines classical BM25 scores with learned neural weights for both early skipping (dynamic pruning) and final ranking. The final score typically takes the form:

$\operatorname{RankScore}(d, \beta) = \beta \cdot \operatorname{RankScore}_B(d) + (1 - \beta) \cdot \operatorname{RankScore}_L(d)$

Dual skipping guidance and guided traversal (Qiao et al., 2022, Mallia et al., 2022): Use both BM25 and neural upper bounds/thresholds to efficiently traverse and prune the inverted index, significantly reducing tail and mean latency.

Dynamic pruning strategies have been tailored for LSR’s unique index statistics:

Block-Max Pruning (BMP) (Mallia et al., 2 May 2024): Structures index as blocks, computes block-level impact maxima for each term, and uses block-level upper bounds for safe and approximate early termination. BMP achieves 2×–60× acceleration over classic WAND-style pruning while maintaining precision.
Clustered and hybrid indexing (Bruch et al., 8 Aug 2024): SeismicWave combines block ordering with graph-based $k$ -NN neighbor expansion to approach near-exact recall at reduced computational cost.

5. Expansion to Long Documents and Context-Aware Scoring

Standard LSR aggregation operates over token-level (windowed) segments, which is suboptimal for long documents due to the lack of position and proximity modeling:

Sequential Dependence Model adaptations (ExactSDM and SoftSDM) (Nguyen et al., 2023, Lionis et al., 31 Mar 2025): Integrate n-gram phrase and proximity features via weighted max-pooling over segments. The core matching functions are:

$\psi_{SO}(q_i\ldots q_{i+k}, D) = \lambda_{SO} \max_{1 \leq r \leq |D| - k} \sum_{l=0}^k w_q^{i+l} \cdot W_{r+l,v}^D[v(q_{i+l})]$

$\psi_{SU}(q_i \ldots q_{i+k}, D) = \lambda_{SU} \max_{1 \leq r \leq |D| - p} \sum_{h=i}^{i+k} w_q^h [\max_{r \leq l < r+p} W_{l,v}^D[v(q_h)] ]$

Empirical assessment (Lionis et al., 31 Mar 2025): Demonstrates that over many datasets, the first segment is disproportionately impactful. Combining early segment scoring with global term aggregation yields robust long-document retrieval.

6. Extending LSR to Multimodal and Conversational Retrieval

Recent work has established the viability of LSR for cross-modal and dialogue-driven search:

Multimodal LSR (Nguyen et al., 12 Feb 2024, Song et al., 22 Aug 2025): Adapts LSR for text–image retrieval by projecting vision-language representations through a sparse head, using mutual knowledge distillation with dense branches for bi-directional improvement. Performance matches or surpasses dense counterparts while remaining compatible with inverted index engines.
Entity integration (Nguyen et al., 10 Oct 2024): Employs dynamic vocabularies with entities for improved ambiguity resolution and adaptation to current knowledge, directly integrating candidate entity signals into the output sparse representation.
Conversational LSR with score distillation (Lupart et al., 18 Oct 2024): DiSCo distills teacher similarity scores directly (rather than representations), enabling the student sparse retriever to more flexibly align with LLM-rewritten conversational queries. Multi-teacher fusion further improves both in-domain and generalization metrics.

7. Advances in Model Scaling, Inference-Free Architectures, and Productionization

The practical deployment of LSR in production search applications is advancing rapidly:

Causal and decoder-only LLMs for LSR (Xu et al., 15 Apr 2025, Doshi et al., 20 Aug 2024, Qiao et al., 25 Apr 2025): Techniques such as adaptation phases, echo embeddings, and bidirectional context injection address the limitations of unidirectional transformers, allowing scaling up to 8B+ parameter LLMs. These yield reduced index sizes (< 8GB vs. > 135GB dense) and strong MRR/nDCG with quantization-tolerant regularization.
Inference-free LSR (Nardini et al., 30 Apr 2025, Geng et al., 7 Nov 2024): Models like Li-LSR pre-learn a tokenwise score table, replacing the query encoder with a fast lookup. Such models achieve near state-of-the-art effectiveness and run at 1.1× BM25 latency.
DF-aware sparsification for real-world latency (Porco et al., 21 May 2025): DF-FLOPS penalizes overuse of common terms during training, reducing the posting-list lengths and cutting real-world latency by up to 10× without substantial loss in relevance metrics. This makes LSR feasible for deployment in production-grade engines such as Solr.
Trade-off control: The introduction of new regularization objectives (e.g., dynamic vocabulary scaling, hybrid dense/sparse distillation) allows system designers to explicitly control the efficiency–effectiveness Pareto frontier.

Summary Table: Key Axes in LSR System Design

Axis	Representative Technique(s)	System Effect
Sparse encoding	Transformer + MLM head, MLP head	Explicit term expansion, semantic coverage
Regularization	FLOPS, DF-FLOPS, IDF-aware, static pruning	Controls sparsity, latency, efficiency
Vocabulary	Corpus-specific, ESPLADE, dynamic-entity	Alters granularity, generalization, latency
Index traversal	Dual skipping, Block-Max Pruning, Seismic	Query-time acceleration, top-k safety/recall
Multimodal/Entity	MLM/MLP vision-language heads, DyVo	Enables robust retrieval for images/entities
Scaling/Prod	Causal LLMs, Li-LSR, quantization	Reduces deployment cost, enables production

LSR systems, by integrating neural term weighting, learned expansion, and advanced regularization with inverted index integration, realize a retrieval framework that is both interpretable and efficient. Recent innovations address the challenges of vocabulary adaptation, document length modeling, multimodal fusion, and latency optimization, constituting a robust basis for modern semantic retrieval infrastructure.