Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 87 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 23 tok/s Pro
GPT-4o 102 tok/s Pro
Kimi K2 166 tok/s Pro
GPT OSS 120B 436 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Learned Sparse Retrieval (LSR)

Updated 28 September 2025
  • Learned Sparse Retrieval (LSR) is a neural retrieval method that generates high-dimensional sparse bag-of-words representations for efficient and semantically accurate search.
  • It leverages fine-tuned Transformer models along with learned term reweighting and expansion to balance semantic coverage with explicit sparsity.
  • LSR systems integrate neural outputs with traditional inverted index structures enabling fast, scalable, and interpretable search in production environments.

Learned Sparse Retrieval (LSR) designates a class of neural information retrieval systems that generate high-dimensional, lexically interpreted, and explicitly sparse bag-of-words representations for queries and documents. These representations—typically produced by fine-tuned Transformer models—can be stored and processed efficiently in conventional inverted index structures, thus aligning the advances of deep neural contextualization with the production efficiency of classic IR engines. LSR methods underpin many modern neural search pipelines and are characterized by their reliance on learned term reweighting, term expansion, and explicit regularization procedures. By fusing neural and lexical paradigms, LSR provides both state-of-the-art semantic retrieval accuracy and system-level efficiency for large-scale, real-time search tasks.

1. Architectural Foundations and Sparse Representation

Fundamentally, an LSR system encodes input text (query or document) into a high-dimensional sparse vector whose dimensions correspond to a selected vocabulary—commonly the WordPiece or other corpus-specific token inventory. Encoding typically leverages a Transformer backbone, augmented by a sparse projection head. Two principal head types are widely studied:

  • MLP head: Produces per-token weights for only tokens present in the input (non-expanding).
  • MLM head: Computes output logits for all vocabulary tokens, enabling the model to perform context-dependent expansion (activating additional tokens not present in the surface form).

The resulting vector is strictly sparse—most dimensions are zero—with sparse activations selected by either max-pooling (as in SPLADE-style models) or by explicit expansion/weighting rules. The document’s representation can be summarized formulaically as:

wj=maxilog(1+ReLU(wi,j))w_j = \max_{i} \log \left( 1 + \operatorname{ReLU}(w_{i,j}) \right)

where wi,jw_{i,j} is the contextual relevance logit for token jj at position ii.

During retrieval, LSR relies on simple, interpretable dot products between sparse query and document vectors:

score(q,d)=fQ(q)fD(d)=kVwqkwdk\operatorname{score}(q, d) = f_Q(q) \cdot f_D(d) = \sum_{k \in V} w_q^k \cdot w_d^k

where VV is the (possibly expanded) vocabulary.

2. Regularization and Sparsification Techniques

A central challenge in LSR is controlling the degree of sparsity, balancing semantic coverage (term expansion) and efficiency. The most common regularization objective is the FLOPS loss, given by:

FLOPS=i=1V(1Nj=1Nwji)2\operatorname{FLOPS} = \sum_{i=1}^{|V|} \left( \frac{1}{N} \sum_{j=1}^N w_{ji} \right)^2

where wjiw_{ji} is the activation of token ii in example jj. This loss pushes down the activation frequency of tokens across the batch, enforcing global sparsity.

Variants and augmentations of this regularizer have been proposed:

  • DF-FLOPS (Porco et al., 21 May 2025): Adds a document-frequency-conditioned weight to the standard FLOPS loss, penalizing overuse of high-DF terms (which would otherwise produce long posting lists and high latency during retrieval).

DFFLOPS=tV[(wt/N)i=1Nri,t]2\ell_{\mathrm{DF-FLOPS}} = \sum_{t \in V} [ (w_t / N) \sum_{i=1}^N r_{i,t} ]^2

  • IDF-aware FLOPS (Geng et al., 7 Nov 2024): Penalizes common terms more than rare ones, ensuring the model focuses on semantically informative tokens.

These regularizers are often complemented by static pruning (truncating representations to top-kk tokens after model output), further balancing quality and efficiency.

3. Advances in Vocabulary Configuration

The role of vocabulary in LSR models extends beyond lexical coverage to serve as the effective “representational specification” for the model’s output space (Kim et al., 20 Sep 2025). Empirical studies show:

  • Expanded vocabularies (e.g., 100K custom token sets via ESPLADE pretraining) confer higher representational granularity, yielding enhanced recall and MRR, especially under aggressive pruning.
  • Corpus-specific vocabularies (Yu et al., 12 Jan 2024)—tokenizer and vocabulary learned directly from the target corpus—result in both shorter posting lists (improved efficiency) and higher retrieval effectiveness.
  • Dynamic vocabularies (Nguyen et al., 10 Oct 2024): Augment the fixed token set with entities (e.g., Wikipedia concepts), addressing entity fragmentation and enabling tracking of evolving knowledge.

Initialization quality is critical; pretrained embeddings (e.g., EMLM from large web corpora) substantially outperform random weight initializations, even at similar vocabulary sizes.

4. System-Level Retrieval, Hybrid Scoring, and Index Traversal

LSR’s integration with inverted indexes enables the realization of efficient retrieval through both lexical and neural signals:

  • Hybrid scoring (Qiao et al., 2022, Mallia et al., 2022): Combines classical BM25 scores with learned neural weights for both early skipping (dynamic pruning) and final ranking. The final score typically takes the form:

RankScore(d,β)=βRankScoreB(d)+(1β)RankScoreL(d)\operatorname{RankScore}(d, \beta) = \beta \cdot \operatorname{RankScore}_B(d) + (1 - \beta) \cdot \operatorname{RankScore}_L(d)

  • Dual skipping guidance and guided traversal (Qiao et al., 2022, Mallia et al., 2022): Use both BM25 and neural upper bounds/thresholds to efficiently traverse and prune the inverted index, significantly reducing tail and mean latency.

Dynamic pruning strategies have been tailored for LSR’s unique index statistics:

  • Block-Max Pruning (BMP) (Mallia et al., 2 May 2024): Structures index as blocks, computes block-level impact maxima for each term, and uses block-level upper bounds for safe and approximate early termination. BMP achieves 2×–60× acceleration over classic WAND-style pruning while maintaining precision.
  • Clustered and hybrid indexing (Bruch et al., 8 Aug 2024): SeismicWave combines block ordering with graph-based kk-NN neighbor expansion to approach near-exact recall at reduced computational cost.

5. Expansion to Long Documents and Context-Aware Scoring

Standard LSR aggregation operates over token-level (windowed) segments, which is suboptimal for long documents due to the lack of position and proximity modeling:

  • Sequential Dependence Model adaptations (ExactSDM and SoftSDM) (Nguyen et al., 2023, Lionis et al., 31 Mar 2025): Integrate n-gram phrase and proximity features via weighted max-pooling over segments. The core matching functions are:

ψSO(qiqi+k,D)=λSOmax1rDkl=0kwqi+lWr+l,vD[v(qi+l)]\psi_{SO}(q_i\ldots q_{i+k}, D) = \lambda_{SO} \max_{1 \leq r \leq |D| - k} \sum_{l=0}^k w_q^{i+l} \cdot W_{r+l,v}^D[v(q_{i+l})]

ψSU(qiqi+k,D)=λSUmax1rDph=ii+kwqh[maxrl<r+pWl,vD[v(qh)]]\psi_{SU}(q_i \ldots q_{i+k}, D) = \lambda_{SU} \max_{1 \leq r \leq |D| - p} \sum_{h=i}^{i+k} w_q^h [\max_{r \leq l < r+p} W_{l,v}^D[v(q_h)] ]

6. Extending LSR to Multimodal and Conversational Retrieval

Recent work has established the viability of LSR for cross-modal and dialogue-driven search:

  • Multimodal LSR (Nguyen et al., 12 Feb 2024, Song et al., 22 Aug 2025): Adapts LSR for text–image retrieval by projecting vision-language representations through a sparse head, using mutual knowledge distillation with dense branches for bi-directional improvement. Performance matches or surpasses dense counterparts while remaining compatible with inverted index engines.
  • Entity integration (Nguyen et al., 10 Oct 2024): Employs dynamic vocabularies with entities for improved ambiguity resolution and adaptation to current knowledge, directly integrating candidate entity signals into the output sparse representation.
  • Conversational LSR with score distillation (Lupart et al., 18 Oct 2024): DiSCo distills teacher similarity scores directly (rather than representations), enabling the student sparse retriever to more flexibly align with LLM-rewritten conversational queries. Multi-teacher fusion further improves both in-domain and generalization metrics.

7. Advances in Model Scaling, Inference-Free Architectures, and Productionization

The practical deployment of LSR in production search applications is advancing rapidly:

  • Causal and decoder-only LLMs for LSR (Xu et al., 15 Apr 2025, Doshi et al., 20 Aug 2024, Qiao et al., 25 Apr 2025): Techniques such as adaptation phases, echo embeddings, and bidirectional context injection address the limitations of unidirectional transformers, allowing scaling up to 8B+ parameter LLMs. These yield reduced index sizes (< 8GB vs. > 135GB dense) and strong MRR/nDCG with quantization-tolerant regularization.
  • Inference-free LSR (Nardini et al., 30 Apr 2025, Geng et al., 7 Nov 2024): Models like Li-LSR pre-learn a tokenwise score table, replacing the query encoder with a fast lookup. Such models achieve near state-of-the-art effectiveness and run at 1.1× BM25 latency.
  • DF-aware sparsification for real-world latency (Porco et al., 21 May 2025): DF-FLOPS penalizes overuse of common terms during training, reducing the posting-list lengths and cutting real-world latency by up to 10× without substantial loss in relevance metrics. This makes LSR feasible for deployment in production-grade engines such as Solr.
  • Trade-off control: The introduction of new regularization objectives (e.g., dynamic vocabulary scaling, hybrid dense/sparse distillation) allows system designers to explicitly control the efficiency–effectiveness Pareto frontier.

Summary Table: Key Axes in LSR System Design

Axis Representative Technique(s) System Effect
Sparse encoding Transformer + MLM head, MLP head Explicit term expansion, semantic coverage
Regularization FLOPS, DF-FLOPS, IDF-aware, static pruning Controls sparsity, latency, efficiency
Vocabulary Corpus-specific, ESPLADE, dynamic-entity Alters granularity, generalization, latency
Index traversal Dual skipping, Block-Max Pruning, Seismic Query-time acceleration, top-k safety/recall
Multimodal/Entity MLM/MLP vision-language heads, DyVo Enables robust retrieval for images/entities
Scaling/Prod Causal LLMs, Li-LSR, quantization Reduces deployment cost, enables production

LSR systems, by integrating neural term weighting, learned expansion, and advanced regularization with inverted index integration, realize a retrieval framework that is both interpretable and efficient. Recent innovations address the challenges of vocabulary adaptation, document length modeling, multimodal fusion, and latency optimization, constituting a robust basis for modern semantic retrieval infrastructure.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Learned Sparse Retrieval (LSR).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube