SPLADE-max: Sparse Max-Pooling IR

Updated 14 December 2025

SPLADE-max is a variant of the SPLADE framework that employs max-pooling to derive sparse, context-expansive term representations for both queries and documents.
It utilizes a dual-tower BERT-based architecture with token-level scoring and explicit FLOPS regularization to achieve a trade-off between efficiency and ranking effectiveness.
Empirical results show that SPLADE-max and its distilled variant deliver state-of-the-art performance on large-scale benchmarks like MS MARCO and BEIR while supporting efficient inverted-index retrieval.

SPLADE-max is a variant of the SPLADE (Sparse Lexical and Expansion Model) framework for information retrieval, specifically designed to produce highly sparse, context-expansive term representations for queries and documents. SPLADE-max modifies SPLADE's term aggregation from sum-pooling to max-pooling, resulting in a sparse embedding derivation that leverages the strongest contextual evidence per term across input tokens. This approach enables efficient inverted-index retrieval while preserving dense-model-level retrieval effectiveness, supporting trade-offs between sparsity (efficiency) and retrieval quality. SPLADE-max has been empirically validated on large-scale retrieval benchmarks and has established state-of-the-art results for both supervised and zero-shot settings (Formal et al., 2021, Formal et al., 2021).

1. Mathematical Formulation of SPLADE-max

SPLADE-max operates on token-level contextual representations generated by a transformer backbone (typically DistilBERT or BERT-base). For each input token $i$ and each vocabulary term $j$ , a token-level score is computed:

$w_{ij} = \mathrm{transform}(h_i)^\top E_j + b_j$

where $h_i$ is the BERT hidden state at position $i$ , $E_j$ is the fixed input embedding of vocabulary term $j$ , and $b_j$ is a learned bias. The “transform” layer consists of a linear projection, GeLU activation, and LayerNorm.

For each vocabulary term $j$ , SPLADE-max replaces the original sum-pooling with max-pooling over positions, applying log-saturation and non-negativity constraint:

$w_j = \max_{i}\,\log\bigl(1 + \max(w_{ij}, 0)\bigr)$

This mechanism assigns each term a score reflecting its strongest contextual support in the input sequence, producing a sparse, non-negative vector $w \in \mathbb{R}^{|V|}$ , where $|V|$ is vocabulary size (≈30k).

2. Retrieval Pipeline and Model Architecture

The SPLADE-max pipeline is structured as a siamese dual-tower model with shared transformer parameters for queries and documents. Each branch processes tokenized input via the backbone, generates token-level scores, and produces a final sparse representation through max-pooling, log-saturation, and rectification. Architecture details include:

Encoder Backbone: DistilBERT or BERT-base, initialized from standard HuggingFace checkpoints.
Two-tower Structure: Separate towers for query and document inputs.
Token-level Scoring Layer: Linear + GeLU + LayerNorm transformation.
Sparse Aggregation: Max-pooling and log(1 + ReLU) per term.

Training Objectives

SPLADE-max utilizes a composite training loss:

Ranking Loss with In-Batch Negatives: For each query $q_i$ , a positive document $d_i^+$ , a hard negative $d_i^-$ (sampled with BM25), and batch negatives:

$\mathcal{L}_{\mathrm{rank\mbox{-}IBN}} = -\log \frac{\exp\bigl(s(q_i, d_i^+)\bigr)}{\exp\bigl(s(q_i, d_i^+)\bigr) + \exp\bigl(s(q_i, d_i^-)\bigr) + \sum_{j \ne i} \exp\bigl(s(q_i, d_j^+)\bigr)}$

where $s(q, d) = \sum_j w^q_j w^d_j$ is the sparse dot-product between query and document vectors.

Explicit Sparsity Regularization (FLOPS Loss):

For a batch of $N$ documents, with average activation per term $j$ denoted $\bar a_j = \frac{1}{N} \sum_{i=1}^N w_j^{(d_i)}$ , the regularization term is:

$\mathcal{L}_{\mathrm{reg}} = \sum_{j \in V} \bar a_j^2$

This penalizes heavily used terms and encourages balanced posting-list sizes.

Combined Objective:

$\mathcal{L} = \mathcal{L}_{\mathrm{rank\mbox{-}IBN}} + \lambda_q \mathcal{L}_{\mathrm{reg}}^q + \lambda_d \mathcal{L}_{\mathrm{reg}}^d$

Separate regularization coefficients $\lambda_q$ , $\lambda_d$ control sparsity for queries and documents.

Inference and Indexing

Document Embeddings: Precomputed offline; only non-zeros stored in inverted index.
Query Embedding: Computed at query time via a single forward pass.
Retrieval: For each non-zero term in $q$ , traverse its posting list across documents, accumulating $s(q, d)$ .

3. Hyperparameters and Sparsity–Effectiveness Trade-Offs

Key hyperparameters include regularization weights $\lambda_q, \lambda_d$ , typically swept from $10^{-1}$ (strong sparsity) down to $10^{-4}$ (higher density). The weights are quadratically ramped up over the first 50,000 training steps, then held constant.

Setting	Effect on Sparsity	Effect on Retrieval Quality
Strong $\lambda$ (1e-1)	Fewer non-zeros	Lower MRR/NDCG
Weak $\lambda$ (1e-4)	Denser vectors	Higher effectiveness

FLOPS Metric: Empirical average $\mathbb{E}[\sum_j p_j^q p_j^d]$ used to estimate query cost.
Typical Non-zeros: For well-chosen $\lambda$ , queries yield ∼6–20 non-zero terms, documents ∼10–50 non-zero terms.

Other training settings: learning rate $2 \times 10^{-5}$ (Adam), batch size 124, sequence length 256, 150k total iterations (50k for SPLADE-doc).

4. Empirical Performance and Benchmarks

SPLADE-max matches or exceeds dense retrievers in ranking effectiveness when evaluated on large-scale retrieval tasks:

MS MARCO Passage Dev Set and TREC DL 2019

Model	MRR@10	R@1000	NDCG@10	R@1000
SPLADE (sum v1)	0.322	0.955	0.665	0.813
SPLADE-max	0.340	0.965	0.684	0.851
DistilSPLADE-max	0.368	0.979	0.729	0.865

Max-pooling yields ≈ 0.018 higher MRR on MS MARCO and ≈ 0.019 higher NDCG on TREC DL 2019 compared to sum-pooling, at equivalent FLOPS budgets.

Zero-shot BEIR (NDCG@10 average over 13 datasets)

Model	Avg. zero-shot NDCG@10
ColBERT	0.457
BM25 (tuned)	0.456
SPLADE (sum)	0.451
SPLADE-max	0.464
DistilSPLADE-max	0.506

DistilSPLADE-max surpasses competitors on 11 out of 13 BEIR collections; plain SPLADE-max already exceeds ColBERT on average.

SPLADE-doc Variant

A document-only max-pool variant, SPLADE-doc, performs an offline embedding computation and achieves MRR@10 ≈ 0.296 with ∼20 non-zeros/document (no query-side expansion).

5. Efficiency, Indexing, and Practical Implications

SPLADE-max's design permits efficient inverted-index-based retrieval. Sparse document vectors are indexed by non-zero term weights only, minimizing storage. Queries are encoded on demand, typically with millisecond-scale latency.

Index Size and Term Coverage: Max-pooling induces slightly sparser vectors than sum-pooling, reducing inverted index sizes for similar retrieval effectiveness.
Vocabulary Mismatch: Contextual expansion assigns non-zero weights to latent terms, mitigating the vocabulary mismatch inherent in bag-of-words approaches.
Retrieval Speed: FLOPS regularization leads to balanced posting-lists. Prototype implementations (Python+Numba) yield rapid query response on MS MARCO-scale corpora.

A plausible implication is that SPLADE-max's sparse contextual embeddings make it practical for production-scale search systems requiring both effectiveness and manageable efficiency budgets.

6. Distillation and Advanced Training Strategies

SPLADE-max supports further improvement via distillation. The DistilSPLADE-max regime introduces a pipeline in which:

An initial SPLADE-max and Cross-Encoder reranker are trained.
SPLADE-max mines hard negatives, which are then rescored by the reranker.
A new SPLADE-max is trained from scratch using a margin-MSE loss over reranked labels.

Distillation produces more discriminative sparse embeddings, with DistilSPLADE-max setting state-of-the-art performance in both supervised and zero-shot retrieval on BEIR and MS MARCO.

7. Summary and Context

SPLADE-max is a principled, efficient modification of the SPLADE architecture, distinguished by its use of max-pooling in term score aggregation. This change enhances sparsity-aware retrieval effectiveness, maintains compatibility with explicit FLOPS regularization, and offers a continuous trade-off between retrieval quality and efficiency by tuning regularization strength. For distilled variants, SPLADE-max achieves state-of-the-art results across major neural IR benchmarks, and due to its sparse, interpretable term representations, remains practical for large-scale retrieval scenarios (Formal et al., 2021, Formal et al., 2021).

PDF Markdown Chat (Pro)

References (2)

SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval (2021)

SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking (2021)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to SPLADE-max.