Echo-Mistral-SPLADE: Advancing Sparse Retrieval

Updated 15 December 2025

The paper demonstrates that Echo-Mistral-SPLADE integrates echo embeddings with a decoder-only Mistral-7B model to significantly enhance sparse retrieval effectiveness.
It employs SPLADE-style sparse projections combined with LoRA fine-tuning and FLOPS regularization to optimize semantic keyword expansion from queries and documents.
Empirical evaluations on the BEIR benchmark reveal a 3–10 point nDCG@10 improvement over previous SPLADE variants, underscoring its scalability and interpretability.

Echo-Mistral-SPLADE is a state-of-the-art learned sparse retrieval model designed to bridge traditional keyword-based sparse retrievers and modern dense retrievers in neural information retrieval. Utilizing a decoder-only LLM, specifically Mistral-7B, Echo-Mistral-SPLADE employs echo embeddings and SPLADE-style sparse projections to learn semantic keyword expansions from queries and documents. This approach leverages the extensive pretraining of decoder-only LLMs to improve sparse retriever effectiveness, culminating in superior performance on the BEIR benchmark compared to all prior SPLADE variants and competitive dense LLM models (Doshi et al., 20 Aug 2024).

1. Architectural Components

Echo-Mistral-SPLADE incorporates several core architectural elements:

Backbone Model: Mistral-7B, a decoder-only causal LLM, serves as the foundation. Unlike encoder-only models typical in LSR (Learned Sparse Retrieval), Mistral-7B is not modified for bidirectional attention.
Echo Embeddings: Each input sequence (query or document) is duplicated and concatenated; only the second copy's token embeddings are pooled. This ensures every token is contextualized over the full sequence, compensating for the inherent causal masking in decoder-only transformers (Springer et al., 2024).
LoRA Fine-Tuning: Following Hu et al. (2022), low-rank adapters (rank 16, α=8, dropout 0.1) are inserted, freezing the main weights and maintaining interpretability constraints critical for learned sparse retrievers.
SPLADE Projection Layer: Adapting the SPLADE paradigm (Formal et al., 2021), the MLM head is used for token-to-term expansion. For each token $i$ , the LM head generates logits $w_i \in \mathbb{R}^{|V|}$ over the WordPiece vocabulary $V$ ; weights remain tied (frozen + LoRA), preserving direct mapping between each logit dimension and vocabulary term.

2. Sparse Representation and Scoring Formulation

Sparse representations are constructed as follows:

Term-level Activation: For input text $t$ (query or document), per-token MLM logits $w_i \in \mathbb{R}^{|V|}$ are aggregated per term as:

$\phi(t)_j = \max_{i \in t} \log(1 + \mathrm{ReLU}(w_{i,j}))$

This pooling induces sparsity and log-saturation for term significance, preventing domination by any single high logit.

FLOPS Regularization: An explicit $\ell_1$ penalty on activations promotes compact inverted indices:

$\mathrm{FLOPS}(t) = \sum_{j=1}^{|V|} \phi(t)_j$

The coefficient $A$ controlling FLOPS strength is ramped quadratically from 0 to $A_\mathrm{final}$ over the first 50k training steps.

Retrieval Scoring: Query and document vectors are scored by vocabulary dot-product:

$\mathrm{score}(q, d) = \sum_{j=1}^{|V|} \phi(q)_j \cdot \phi(d)_j$

During retrieval, only nonzero query terms are used to probe the inverted index.

3. Training Objectives and Optimization

Echo-Mistral-SPLADE is trained end-to-end with a contrastive objective and sparsity regularization:

Data: 15.5 million query-passage pairs from the Sentence-Transformer corpus (Reimers and Gurevych, 2019), sampled proportional to dataset size.
Contrastive InfoNCE Loss: Given query $q$ , positive $d^+$ and in-batch negatives $\{d^-\}$ :

$P(d^+|q) = \frac{\exp(\mathrm{score}(q,d^+))}{\sum_{d \in \{d^+, d^-\}} \exp(\mathrm{score}(q,d))}$

$L_\mathrm{InfoNCE} = - \log P(d^+|q)$

FLOPS Regularization: Applied to both $q$ and $d^+$ :

$L_\mathrm{FLOPS} = A \left[ \mathrm{FLOPS}(q) + \mathrm{FLOPS}(d^+) \right]$

Full Objective:

$L = L_\mathrm{InfoNCE} + L_\mathrm{FLOPS}$

Optimization: Adam (learning rate $2\times10^{-5}$ , linear decay, 6k warmup), batch size $512$, sequence length $512$ (echoed), $150$k training steps, quadratic FLOPS ramp.

4. Retrieval Pipeline and Indexing

Echo-Mistral-SPLADE utilizes an inverted index pipeline optimized for sparse vectors:

Index Construction:

Encode documents $d$ via Echo-Mistral-SPLADE to obtain $\phi(d)$ .
Retain only top- $K$ $\phi(d)_j$ values per document (e.g., $K=128$ ).
Build posting lists for each term $j$ : $(\mathrm{doc\_id},\phi(d)_j)$ .

Query Time:

Encode query $q$ to $\phi(q)$ ; extract nonzero terms.
For each active term $j$ , retrieve its posting list, multiply $\phi(q)_j \cdot \phi(d)_j$ for scoring.
Return top- $N$ documents by total score.

**No reranking stage is included in the current approach.

5. Empirical Evaluation and Benchmarking

Echo-Mistral-SPLADE exhibits strong empirical performance on industry-standard benchmarks:

BEIR Zero-Shot nDCG@10 (13 datasets)

Model	Avg. nDCG@10
BM25	44.02
ColBERTv2 (dense, late interaction)	49.95
SPLADE v2	50.72
SPLADE v3	51.68
Elser v2	52.07
BERT-base + Sent-Trans (sparse)	47.89
Echo-Mistral-SPLADE	55.07

Echo-Mistral-SPLADE surpasses all prior sparse retrievers by 3–10 points (Doshi et al., 20 Aug 2024).

Dense LLM Comparison (BEIR nDCG@10)

Model	Avg. nDCG@10
LLM2Vec on Mistral-7B (dense)	57.6
Echo-Embeddings on Mistral-7B	56.7
Echo-Mistral-SPLADE (sparse)	55.1

While still sparse, Echo-Mistral-SPLADE achieves competitive performance against dense LLM models trained partly in-domain.

6. Ablation Studies and Model Analysis

Reported analyses indicate several critical findings:

Echo Embeddings: Doubling the input and pooling only the second sequence notably improves term expansion quality relative to naïve (single-pass) causal encoding.
Decoder-Only LLM Transition: Migrating from BERT-base to Mistral-7B, leveraging extensive pretraining, enables learning of more sophisticated keyword expansions. This yields a 3–5 point BEIR nDCG@10 improvement over SPLADE v3.
FLOPS Regularization Scheduling: Quadratic ramping of $A$ is essential; excessive initial regularization impedes convergence.
Expansion Semantics: Sparse vectors $\phi(q)$ activate both direct query terms and semantically related expansions (e.g., input "first person land moon" yields expanded terms such as "Armstrong," "NASA," "Apollo").

7. Relation to SPLADE and Other IR Frameworks

SPLADE (Formal et al., 2021b) is a first-stage ranker employing BERT-based Siamese encoders and explicit sparsity regularization with log-saturation pooling (Formal et al., 2021). SPLADE reinvents bag-of-words retrieval by learning contextual term expansions, balancing interpretability and efficiency with controlled sparsity, and is compatible with inverted-index engines and scalable IR pipelines such as Echo and Mistral.

Echo-Mistral-SPLADE extends the SPLADE paradigm to decoder-only LLMs, utilizing echo embeddings and LoRA to adapt the MLM head for sparse term expansion. This shift enables the exploitation of the broader pretraining and latent knowledge of large decoder-only models, facilitating improved sparse retrieval without sacrificing efficiency or transparency.

A plausible implication is that further integration of decoder-only LLMs with SPLADE-type objectives could continue to advance the interpretability-effectiveness frontier in sparse neural information retrieval.

Echo-Mistral-SPLADE exemplifies the synthesis of large-scale decoder-only LLMs and sparse-expansion retrieval, yielding state-of-the-art zero-shot sparse retrieval performance within interpretable, scalable IR architectures (Doshi et al., 20 Aug 2024, Formal et al., 2021).