SPLADE-max: Sparse Max-Pooling IR
- SPLADE-max is a variant of the SPLADE framework that employs max-pooling to derive sparse, context-expansive term representations for both queries and documents.
- It utilizes a dual-tower BERT-based architecture with token-level scoring and explicit FLOPS regularization to achieve a trade-off between efficiency and ranking effectiveness.
- Empirical results show that SPLADE-max and its distilled variant deliver state-of-the-art performance on large-scale benchmarks like MS MARCO and BEIR while supporting efficient inverted-index retrieval.
SPLADE-max is a variant of the SPLADE (Sparse Lexical and Expansion Model) framework for information retrieval, specifically designed to produce highly sparse, context-expansive term representations for queries and documents. SPLADE-max modifies SPLADE's term aggregation from sum-pooling to max-pooling, resulting in a sparse embedding derivation that leverages the strongest contextual evidence per term across input tokens. This approach enables efficient inverted-index retrieval while preserving dense-model-level retrieval effectiveness, supporting trade-offs between sparsity (efficiency) and retrieval quality. SPLADE-max has been empirically validated on large-scale retrieval benchmarks and has established state-of-the-art results for both supervised and zero-shot settings (Formal et al., 2021, Formal et al., 2021).
1. Mathematical Formulation of SPLADE-max
SPLADE-max operates on token-level contextual representations generated by a transformer backbone (typically DistilBERT or BERT-base). For each input token and each vocabulary term , a token-level score is computed:
where is the BERT hidden state at position , is the fixed input embedding of vocabulary term , and is a learned bias. The “transform” layer consists of a linear projection, GeLU activation, and LayerNorm.
For each vocabulary term , SPLADE-max replaces the original sum-pooling with max-pooling over positions, applying log-saturation and non-negativity constraint:
This mechanism assigns each term a score reflecting its strongest contextual support in the input sequence, producing a sparse, non-negative vector , where is vocabulary size (≈30k).
2. Retrieval Pipeline and Model Architecture
The SPLADE-max pipeline is structured as a siamese dual-tower model with shared transformer parameters for queries and documents. Each branch processes tokenized input via the backbone, generates token-level scores, and produces a final sparse representation through max-pooling, log-saturation, and rectification. Architecture details include:
- Encoder Backbone: DistilBERT or BERT-base, initialized from standard HuggingFace checkpoints.
- Two-tower Structure: Separate towers for query and document inputs.
- Token-level Scoring Layer: Linear + GeLU + LayerNorm transformation.
- Sparse Aggregation: Max-pooling and log(1 + ReLU) per term.
Training Objectives
SPLADE-max utilizes a composite training loss:
- Ranking Loss with In-Batch Negatives: For each query , a positive document , a hard negative (sampled with BM25), and batch negatives:
$\mathcal{L}_{\mathrm{rank\mbox{-}IBN}} = -\log \frac{\exp\bigl(s(q_i, d_i^+)\bigr)}{\exp\bigl(s(q_i, d_i^+)\bigr) + \exp\bigl(s(q_i, d_i^-)\bigr) + \sum_{j \ne i} \exp\bigl(s(q_i, d_j^+)\bigr)}$
where is the sparse dot-product between query and document vectors.
- Explicit Sparsity Regularization (FLOPS Loss):
For a batch of documents, with average activation per term denoted , the regularization term is:
This penalizes heavily used terms and encourages balanced posting-list sizes.
- Combined Objective:
$\mathcal{L} = \mathcal{L}_{\mathrm{rank\mbox{-}IBN}} + \lambda_q \mathcal{L}_{\mathrm{reg}}^q + \lambda_d \mathcal{L}_{\mathrm{reg}}^d$
Separate regularization coefficients , control sparsity for queries and documents.
Inference and Indexing
- Document Embeddings: Precomputed offline; only non-zeros stored in inverted index.
- Query Embedding: Computed at query time via a single forward pass.
- Retrieval: For each non-zero term in , traverse its posting list across documents, accumulating .
3. Hyperparameters and Sparsity–Effectiveness Trade-Offs
Key hyperparameters include regularization weights , typically swept from (strong sparsity) down to (higher density). The weights are quadratically ramped up over the first 50,000 training steps, then held constant.
| Setting | Effect on Sparsity | Effect on Retrieval Quality |
|---|---|---|
| Strong (1e-1) | Fewer non-zeros | Lower MRR/NDCG |
| Weak (1e-4) | Denser vectors | Higher effectiveness |
- FLOPS Metric: Empirical average used to estimate query cost.
- Typical Non-zeros: For well-chosen , queries yield ∼6–20 non-zero terms, documents ∼10–50 non-zero terms.
Other training settings: learning rate (Adam), batch size 124, sequence length 256, 150k total iterations (50k for SPLADE-doc).
4. Empirical Performance and Benchmarks
SPLADE-max matches or exceeds dense retrievers in ranking effectiveness when evaluated on large-scale retrieval tasks:
MS MARCO Passage Dev Set and TREC DL 2019
| Model | MRR@10 | R@1000 | NDCG@10 | R@1000 |
|---|---|---|---|---|
| SPLADE (sum v1) | 0.322 | 0.955 | 0.665 | 0.813 |
| SPLADE-max | 0.340 | 0.965 | 0.684 | 0.851 |
| DistilSPLADE-max | 0.368 | 0.979 | 0.729 | 0.865 |
Max-pooling yields ≈ 0.018 higher MRR on MS MARCO and ≈ 0.019 higher NDCG on TREC DL 2019 compared to sum-pooling, at equivalent FLOPS budgets.
Zero-shot BEIR (NDCG@10 average over 13 datasets)
| Model | Avg. zero-shot NDCG@10 |
|---|---|
| ColBERT | 0.457 |
| BM25 (tuned) | 0.456 |
| SPLADE (sum) | 0.451 |
| SPLADE-max | 0.464 |
| DistilSPLADE-max | 0.506 |
DistilSPLADE-max surpasses competitors on 11 out of 13 BEIR collections; plain SPLADE-max already exceeds ColBERT on average.
SPLADE-doc Variant
A document-only max-pool variant, SPLADE-doc, performs an offline embedding computation and achieves MRR@10 ≈ 0.296 with ∼20 non-zeros/document (no query-side expansion).
5. Efficiency, Indexing, and Practical Implications
SPLADE-max's design permits efficient inverted-index-based retrieval. Sparse document vectors are indexed by non-zero term weights only, minimizing storage. Queries are encoded on demand, typically with millisecond-scale latency.
- Index Size and Term Coverage: Max-pooling induces slightly sparser vectors than sum-pooling, reducing inverted index sizes for similar retrieval effectiveness.
- Vocabulary Mismatch: Contextual expansion assigns non-zero weights to latent terms, mitigating the vocabulary mismatch inherent in bag-of-words approaches.
- Retrieval Speed: FLOPS regularization leads to balanced posting-lists. Prototype implementations (Python+Numba) yield rapid query response on MS MARCO-scale corpora.
A plausible implication is that SPLADE-max's sparse contextual embeddings make it practical for production-scale search systems requiring both effectiveness and manageable efficiency budgets.
6. Distillation and Advanced Training Strategies
SPLADE-max supports further improvement via distillation. The DistilSPLADE-max regime introduces a pipeline in which:
- An initial SPLADE-max and Cross-Encoder reranker are trained.
- SPLADE-max mines hard negatives, which are then rescored by the reranker.
- A new SPLADE-max is trained from scratch using a margin-MSE loss over reranked labels.
Distillation produces more discriminative sparse embeddings, with DistilSPLADE-max setting state-of-the-art performance in both supervised and zero-shot retrieval on BEIR and MS MARCO.
7. Summary and Context
SPLADE-max is a principled, efficient modification of the SPLADE architecture, distinguished by its use of max-pooling in term score aggregation. This change enhances sparsity-aware retrieval effectiveness, maintains compatibility with explicit FLOPS regularization, and offers a continuous trade-off between retrieval quality and efficiency by tuning regularization strength. For distilled variants, SPLADE-max achieves state-of-the-art results across major neural IR benchmarks, and due to its sparse, interpretable term representations, remains practical for large-scale retrieval scenarios (Formal et al., 2021, Formal et al., 2021).