Papers
Topics
Authors
Recent
2000 character limit reached

SPLADE-max: Sparse Max-Pooling IR

Updated 14 December 2025
  • SPLADE-max is a variant of the SPLADE framework that employs max-pooling to derive sparse, context-expansive term representations for both queries and documents.
  • It utilizes a dual-tower BERT-based architecture with token-level scoring and explicit FLOPS regularization to achieve a trade-off between efficiency and ranking effectiveness.
  • Empirical results show that SPLADE-max and its distilled variant deliver state-of-the-art performance on large-scale benchmarks like MS MARCO and BEIR while supporting efficient inverted-index retrieval.

SPLADE-max is a variant of the SPLADE (Sparse Lexical and Expansion Model) framework for information retrieval, specifically designed to produce highly sparse, context-expansive term representations for queries and documents. SPLADE-max modifies SPLADE's term aggregation from sum-pooling to max-pooling, resulting in a sparse embedding derivation that leverages the strongest contextual evidence per term across input tokens. This approach enables efficient inverted-index retrieval while preserving dense-model-level retrieval effectiveness, supporting trade-offs between sparsity (efficiency) and retrieval quality. SPLADE-max has been empirically validated on large-scale retrieval benchmarks and has established state-of-the-art results for both supervised and zero-shot settings (Formal et al., 2021, Formal et al., 2021).

1. Mathematical Formulation of SPLADE-max

SPLADE-max operates on token-level contextual representations generated by a transformer backbone (typically DistilBERT or BERT-base). For each input token ii and each vocabulary term jj, a token-level score is computed:

wij=transform(hi)Ej+bjw_{ij} = \mathrm{transform}(h_i)^\top E_j + b_j

where hih_i is the BERT hidden state at position ii, EjE_j is the fixed input embedding of vocabulary term jj, and bjb_j is a learned bias. The “transform” layer consists of a linear projection, GeLU activation, and LayerNorm.

For each vocabulary term jj, SPLADE-max replaces the original sum-pooling with max-pooling over positions, applying log-saturation and non-negativity constraint:

wj=maxilog(1+max(wij,0))w_j = \max_{i}\,\log\bigl(1 + \max(w_{ij}, 0)\bigr)

This mechanism assigns each term a score reflecting its strongest contextual support in the input sequence, producing a sparse, non-negative vector wRVw \in \mathbb{R}^{|V|}, where V|V| is vocabulary size (≈30k).

2. Retrieval Pipeline and Model Architecture

The SPLADE-max pipeline is structured as a siamese dual-tower model with shared transformer parameters for queries and documents. Each branch processes tokenized input via the backbone, generates token-level scores, and produces a final sparse representation through max-pooling, log-saturation, and rectification. Architecture details include:

  • Encoder Backbone: DistilBERT or BERT-base, initialized from standard HuggingFace checkpoints.
  • Two-tower Structure: Separate towers for query and document inputs.
  • Token-level Scoring Layer: Linear + GeLU + LayerNorm transformation.
  • Sparse Aggregation: Max-pooling and log(1 + ReLU) per term.

Training Objectives

SPLADE-max utilizes a composite training loss:

  1. Ranking Loss with In-Batch Negatives: For each query qiq_i, a positive document di+d_i^+, a hard negative did_i^- (sampled with BM25), and batch negatives:

$\mathcal{L}_{\mathrm{rank\mbox{-}IBN}} = -\log \frac{\exp\bigl(s(q_i, d_i^+)\bigr)}{\exp\bigl(s(q_i, d_i^+)\bigr) + \exp\bigl(s(q_i, d_i^-)\bigr) + \sum_{j \ne i} \exp\bigl(s(q_i, d_j^+)\bigr)}$

where s(q,d)=jwjqwjds(q, d) = \sum_j w^q_j w^d_j is the sparse dot-product between query and document vectors.

  1. Explicit Sparsity Regularization (FLOPS Loss):

For a batch of NN documents, with average activation per term jj denoted aˉj=1Ni=1Nwj(di)\bar a_j = \frac{1}{N} \sum_{i=1}^N w_j^{(d_i)}, the regularization term is:

Lreg=jVaˉj2\mathcal{L}_{\mathrm{reg}} = \sum_{j \in V} \bar a_j^2

This penalizes heavily used terms and encourages balanced posting-list sizes.

  1. Combined Objective:

$\mathcal{L} = \mathcal{L}_{\mathrm{rank\mbox{-}IBN}} + \lambda_q \mathcal{L}_{\mathrm{reg}}^q + \lambda_d \mathcal{L}_{\mathrm{reg}}^d$

Separate regularization coefficients λq\lambda_q, λd\lambda_d control sparsity for queries and documents.

Inference and Indexing

  • Document Embeddings: Precomputed offline; only non-zeros stored in inverted index.
  • Query Embedding: Computed at query time via a single forward pass.
  • Retrieval: For each non-zero term in qq, traverse its posting list across documents, accumulating s(q,d)s(q, d).

3. Hyperparameters and Sparsity–Effectiveness Trade-Offs

Key hyperparameters include regularization weights λq,λd\lambda_q, \lambda_d, typically swept from 10110^{-1} (strong sparsity) down to 10410^{-4} (higher density). The weights are quadratically ramped up over the first 50,000 training steps, then held constant.

Setting Effect on Sparsity Effect on Retrieval Quality
Strong λ\lambda (1e-1) Fewer non-zeros Lower MRR/NDCG
Weak λ\lambda (1e-4) Denser vectors Higher effectiveness
  • FLOPS Metric: Empirical average E[jpjqpjd]\mathbb{E}[\sum_j p_j^q p_j^d] used to estimate query cost.
  • Typical Non-zeros: For well-chosen λ\lambda, queries yield ∼6–20 non-zero terms, documents ∼10–50 non-zero terms.

Other training settings: learning rate 2×1052 \times 10^{-5} (Adam), batch size 124, sequence length 256, 150k total iterations (50k for SPLADE-doc).

4. Empirical Performance and Benchmarks

SPLADE-max matches or exceeds dense retrievers in ranking effectiveness when evaluated on large-scale retrieval tasks:

MS MARCO Passage Dev Set and TREC DL 2019

Model MRR@10 R@1000 NDCG@10 R@1000
SPLADE (sum v1) 0.322 0.955 0.665 0.813
SPLADE-max 0.340 0.965 0.684 0.851
DistilSPLADE-max 0.368 0.979 0.729 0.865

Max-pooling yields ≈ 0.018 higher MRR on MS MARCO and ≈ 0.019 higher NDCG on TREC DL 2019 compared to sum-pooling, at equivalent FLOPS budgets.

Zero-shot BEIR (NDCG@10 average over 13 datasets)

Model Avg. zero-shot NDCG@10
ColBERT 0.457
BM25 (tuned) 0.456
SPLADE (sum) 0.451
SPLADE-max 0.464
DistilSPLADE-max 0.506

DistilSPLADE-max surpasses competitors on 11 out of 13 BEIR collections; plain SPLADE-max already exceeds ColBERT on average.

SPLADE-doc Variant

A document-only max-pool variant, SPLADE-doc, performs an offline embedding computation and achieves MRR@10 ≈ 0.296 with ∼20 non-zeros/document (no query-side expansion).

5. Efficiency, Indexing, and Practical Implications

SPLADE-max's design permits efficient inverted-index-based retrieval. Sparse document vectors are indexed by non-zero term weights only, minimizing storage. Queries are encoded on demand, typically with millisecond-scale latency.

  • Index Size and Term Coverage: Max-pooling induces slightly sparser vectors than sum-pooling, reducing inverted index sizes for similar retrieval effectiveness.
  • Vocabulary Mismatch: Contextual expansion assigns non-zero weights to latent terms, mitigating the vocabulary mismatch inherent in bag-of-words approaches.
  • Retrieval Speed: FLOPS regularization leads to balanced posting-lists. Prototype implementations (Python+Numba) yield rapid query response on MS MARCO-scale corpora.

A plausible implication is that SPLADE-max's sparse contextual embeddings make it practical for production-scale search systems requiring both effectiveness and manageable efficiency budgets.

6. Distillation and Advanced Training Strategies

SPLADE-max supports further improvement via distillation. The DistilSPLADE-max regime introduces a pipeline in which:

  1. An initial SPLADE-max and Cross-Encoder reranker are trained.
  2. SPLADE-max mines hard negatives, which are then rescored by the reranker.
  3. A new SPLADE-max is trained from scratch using a margin-MSE loss over reranked labels.

Distillation produces more discriminative sparse embeddings, with DistilSPLADE-max setting state-of-the-art performance in both supervised and zero-shot retrieval on BEIR and MS MARCO.

7. Summary and Context

SPLADE-max is a principled, efficient modification of the SPLADE architecture, distinguished by its use of max-pooling in term score aggregation. This change enhances sparsity-aware retrieval effectiveness, maintains compatibility with explicit FLOPS regularization, and offers a continuous trade-off between retrieval quality and efficiency by tuning regularization strength. For distilled variants, SPLADE-max achieves state-of-the-art results across major neural IR benchmarks, and due to its sparse, interpretable term representations, remains practical for large-scale retrieval scenarios (Formal et al., 2021, Formal et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to SPLADE-max.