Learned Sparse Models: SPLADE & uniCOIL

Updated 20 March 2026

Learned sparse models are neural architectures that produce high-dimensional, sparse vector representations via end-to-end training, enabling efficient inverted index usage.
SPLADE projects transformer token embeddings with an MLM head for expansion and sparsity, while uniCOIL assigns weights only to observed tokens for fast retrieval.
Advanced regularization and pruning techniques balance effectiveness with efficiency, optimizing index sizes, latency, and retrieval quality in practice.

Learned sparse models in information retrieval, exemplified by SPLADE and uniCOIL, are neural architectures that encode queries and documents as high-dimensional sparse vectors aligned to a fixed vocabulary. Unlike classical bag-of-words models, the term weights are learned end-to-end from large-scale supervision, often harnessing masked language modeling heads or reweighting mechanisms on top of transformer embeddings. These models are designed to leverage the efficiency of inverted indexes while introducing semantic matching and expansion capabilities historically associated with dense approaches.

1. Model Architectures: SPLADE and uniCOIL

SPLADE operates by projecting transformer token embeddings through an MLM head to produce contextual scores for every vocabulary term, followed by max-pooling and log-saturation to enforce sparsity. The resulting representation for an input $x$ (either a query or document) is a vector $w[x]\in\mathbb{R}^{|V|}$ where:

$w[k] = \max_{i \in T} \log(1 + \text{ReLU}(W_{i,k})),$

$T$ is the sequence of input tokens, and $W_{i,k}$ is the MLM logit for token $i$ mapping to term $k$ (Formal et al., 2021, Formal et al., 2021, Nguyen et al., 2023). The representations are highly sparse due to explicit regularization.

uniCOIL assigns a single learned weight to each token present in the input, omitting full-vocabulary expansion. The vector $d \in \mathbb{R}^{|V|}$ has non-zero entries only at the indices of observed tokens, and scoring is simply the sum over shared nonzero token indices (Yang et al., 2021, Nguyen et al., 2023). No expansion or reweighting of out-of-vocabulary tokens occurs.

Both models enable direct use of traditional inverted indexes due to sparsity.

2. Training Objectives and Regularization

The canonical loss for these models combines retrieval performance with explicit sparsity penalties:

Supervised loss is typically a contrastive or distillation loss (e.g., in-batch negative log-softmax, Margin-MSE to teacher cross-encoder scores) (Formal et al., 2021, Formal et al., 2022, Formal et al., 2021).
Sparsity induction employs $\ell_1$ or “FLOPS” regularization:

$R = \sum_{j \in V} |w_j|,$

$w[x]\in\mathbb{R}^{|V|}$ 0

to encourage a small number of nonzeros per representation and balanced posting-list lengths (Formal et al., 2021, Formal et al., 2021).

Advanced strategies include DF-FLOPS, whereby additional regularization penalizes activations of terms with high document frequency, directly targeting search latency and posting-list length by discouraging frequent terms unless salient (Porco et al., 21 May 2025).

3. Expansion, Weighting, and Vocabulary Design

Learned sparse models integrate lexical expansion, weighting, and controlled vocabulary in various ways:

Expansion: SPLADE performs soft expansion of both queries and documents, assigning weights to vocabulary tokens absent from the input if the MLM head predicts them as contextually relevant (Formal et al., 2021, Formal et al., 2021, Mackenzie et al., 2023); uniCOIL applies only observed-token weighting.
Weighting: Both models learn fine-grained term importance, outperforming binary (presence/absence) weighting as in naive bag-of-words approaches (Nguyen et al., 2023).
Vocabulary: The choice and size of vocabulary directly affect sparsity and effectiveness. Expanded vocabularies (e.g., ESPLADE with 100K natural language tokens) enable greater representational capacity and finer granularity of matching, as established by controlled-vocabulary ablations (Kim et al., 20 Sep 2025, Mackenzie et al., 2023, Yu et al., 2024). Corpus-specific vocabularies and increased vocabulary size can reduce latency and improve recall by shortening average posting lists and preventing over-splitting of tokens.

4. Efficiency–Effectiveness Trade-offs and Indexing

The central challenge in learned sparse retrieval is balancing effectiveness with query/document lengths and latency:

Pruning Techniques: Top- $w[x]\in\mathbb{R}^{|V|}$ 1 masking, static thresholding, and hybrid thresholding selectively constrain the number of nonzeros in representations, offering precise trade-offs between index size, latency, and retrieval quality (Yang et al., 2021, Qiao et al., 2023).
Two-Step SPLADE: A cascade inference scheme using an aggressively pruned index for initial retrieval followed by full-SPLADE rescoring yields $w[x]\in\mathbb{R}^{|V|}$ 2– $w[x]\in\mathbb{R}^{|V|}$ 3 speed-ups without statistically significant effectiveness loss on most benchmarks (Lassance et al., 2024).
Dynamic Pruning with Traversal Guidance: Techniques such as BM25-guided traversal, two-level pruning, and block-max index traversal prune candidate lists more efficiently, though aggressive BM25 guidance can harm recall if expansion terms diverge from BM25 distributions (Qiao et al., 2023).
DF-FLOPS Regularization: By penalizing use of high-frequency terms during training, practical average and tail latency can be reduced by an order of magnitude with minimal decrease in effectiveness (Porco et al., 21 May 2025).
Inference-free Retrieval: In settings where query encoding is the latency bottleneck, the Li-LSR framework replaces the transformer query encoder with a learned token score table, yielding sub-millisecond query times with negligible loss in accuracy (Nardini et al., 30 Apr 2025).

5. Empirical Performance and Benchmarking

SPLADE and its variants consistently achieve state-of-the-art first-stage retrieval results:

SPLADE-v3 pushes first-stage MRR@10 over 40 on MS MARCO dev and boosts BEIR mean nDCG@10 to 51.7, outperforming BM25 and earlier variants by statistically significant margins (Lassance et al., 2024).
Echo-Mistral-SPLADE, a decoder-only LLM-based variant, outperforms all prior learned sparse and dense retrievers on zero-shot BEIR, achieving an average sparse nDCG@10 of 55.1 (Doshi et al., 2024).
Comparative Table:

Model	MS MARCO MRR@10	BEIR nDCG@10	Typical Latency	Notes
BM25	~18.4	~41–45	~69 ms	Non-learned sparse
uniCOIL	36–37	45–48	10–12 ms	No expansion
SPLADE-v3	40.2	51.7	70–100 ms	Full expansion
DF-FLOPS-SPLADE	~30	>BM25	88–161 ms	BM25-like latency
Two-Step SPLADE	40.0	47.6–70.0	2–2.4 ms	12–40x speedup
Echo-Mistral-SP	—	55.1	—	LLM backbone, SOTA sparse

Empirical findings indicate that document weighting is essential, query weighting yields small but positive gains, and concurrent query and document expansion exhibits a cancellation effect, making single-side expansion generally optimal (Nguyen et al., 2023).

6. Adapting to Document Length and Proximal Scoring

For long documents, naive aggregation of segment-level representations introduces noise and degrades performance. Max-score aggregation, which selects the top-scoring segment for each query, enforces local proximity constraints and performs robustly across document length. Incorporating explicit proximity models, such as ExactSDM and SoftSDM, further improves effectiveness; ExactSDM, which uses only exact term dependence within local windows, generally matches or outperforms SoftSDM and does not rely on expansion, facilitating compatibility with both SPLADE and uniCOIL-like models (Nguyen et al., 2023).

7. Broader Implications and Design Recommendations

Recent studies clarify that sparse lexical representation in SPLADE-like models is less about traditional lexical semantics and more about introducing sparse high-dimensional "feature buckets." The effectiveness is governed by vocabulary size and regularization, not surface token meaning (Mackenzie et al., 2023). Consequently, future models may combine interpretable lexical features and latent dimensions or exploit expanded vocabularies to achieve enhanced recall and precise efficiency control. The core design guidelines are:

Always learn weightings for both query and document sides, with document weighting critical to effectiveness (Nguyen et al., 2023).
Prefer expansion on one side (typically documents).
Tune sparsity regularizers (FLOPS, $w[x]\in\mathbb{R}^{|V|}$ 4, DF-FLOPS) to balance effectiveness against latency, using corpus-specific vocabularies for additional gains (Yu et al., 2024, Kim et al., 20 Sep 2025, Porco et al., 21 May 2025).
Employ dynamic or static pruning, hybrid thresholding, or cascade retrieval to optimize the efficiency–effectiveness trade-off (Lassance et al., 2024, Qiao et al., 2023).
For production deployment, integrate latency- and cost-aware regularization such as DF-FLOPS, consider model distillation from cross-encoders, and benchmark index sizes and retrieval times comprehensively.

Learned sparse models, especially SPLADE and its modern descendants, provide a scalable, interpretable, and highly competitive alternative to both traditional sparse and modern dense retrievers, with ongoing advances in efficiency, representational power, and practical deployability (Formal et al., 2021, Formal et al., 2021, Lassance et al., 2024, Doshi et al., 2024).