Plug-in Embedding Pruning (PEP)

Updated 6 March 2026

Plug-in Embedding Pruning (PEP) is a method that prunes embedding representations via adaptive plug-in modules in both dense retrieval and recommendation systems.
It employs salience criteria such as inverse collection frequency and learnable masking thresholds to discard redundant embeddings while retaining performance.
PEP achieves significant efficiency gains, including up to 70% reduction in latency for retrieval and nearly 99.998% parameter reduction for recommendation tasks.

Plug-in Embedding Pruning (PEP) refers to a class of methods that prune or sparsify embedding representations in deep learning models via an adaptive, easily-integrated plug-in mechanism. Two major instantiations have been proposed with distinct problem settings: (1) query embedding pruning for efficient dense retrieval (Tonellotto et al., 2021), and (2) embedding parameter pruning for memory-efficient recommender systems (Liu et al., 2021). Both employ plug-in design and embedding-wise salience or redundancy criteria to achieve dramatic efficiency gains with minimal or no reduction in effectiveness.

1. Notational Foundations and Problem Settings

Let $V$ denote the vocabulary or set of feature IDs; $d$ is embedding dimensionality. For dense retrieval (ColBERT) (Tonellotto et al., 2021), a query $q = \langle t_1,\ldots,t_n\rangle \in V^n$ is mapped by an encoder $f_Q$ to $m$ query-token embeddings, $\{\phi_i\}_{i=1}^m = f_Q(t_1,\ldots,t_n) \in \mathbb{R}^{m\times d}$ ; a document $d = \langle u_1,\ldots,u_L\rangle$ is similarly mapped to $\{\psi_j\}_{j=1}^L = f_D(u_1,\ldots,u_L) \in \mathbb{R}^{L\times d}$ . In recommendation (Liu et al., 2021), each feature field $i$ has an embedding table $V_i\in\mathbb{R}^{n_i\times d}$ ; inputs are $d$ 0.

2. PEP for Dense Retrieval: Embedding Pruning at Query Time

In the context of ColBERT, PEP reduces the number of query-token embeddings used for Approximate Nearest Neighbour (ANN) candidate retrieval. The ColBERT scoring function is

$d$ 1

where $d$ 2 is the dot-product.

Candidate retrieval is handled by FAISS-based ANN search: for each query embedding $d$ 3, retrieve its $d$ 4 nearest document embeddings $d$ 5; form candidate document sets

$d$ 6

Then all documents in $d$ 7 are scored exactly with the full MaxSim ( $d$ 8) operator.

Pruning Criterion and Algorithmic Procedure

PEP ranks query embeddings $d$ 9 by the Inverse Collection Frequency (ICF) of their corresponding tokens: $q = \langle t_1,\ldots,t_n\rangle \in V^n$ 0, where $q = \langle t_1,\ldots,t_n\rangle \in V^n$ 1 is collection frequency, or equivalently $q = \langle t_1,\ldots,t_n\rangle \in V^n$ 2. The $q = \langle t_1,\ldots,t_n\rangle \in V^n$ 3 embeddings with highest ICF are selected for candidate retrieval. The remaining, less-informative embeddings are omitted from the ANN search but all $q = \langle t_1,\ldots,t_n\rangle \in V^n$ 4 are reinstated for the final exact scoring.

The plug-in nature is reflected by its full query-time modularity: PEP is inserted after token encoding, before ANN search, requiring no retraining or index rebuilding.

3. PEP for Recommendation: Learnable Embedding Parameter Pruning

In recommendation models, the scale of embedding tables for high-cardinality categorical features is a principal bottleneck (Liu et al., 2021). PEP formulates embedding pruning as $q = \langle t_1,\ldots,t_n\rangle \in V^n$ 5-constrained optimization: $q = \langle t_1,\ldots,t_n\rangle \in V^n$ 6 where $q = \langle t_1,\ldots,t_n\rangle \in V^n$ 7 counts nonzero entries. To permit gradient-based optimization, a differentiable masking function $q = \langle t_1,\ldots,t_n\rangle \in V^n$ 8 is constructed: $q = \langle t_1,\ldots,t_n\rangle \in V^n$ 9 with learnable thresholds $f_Q$ 0 parameterized at desired granularity (global, dimension-wise, feature-wise, or feature-dimension-wise).

During training, gradient updates naturally steer threshold variables to prune redundant components; at convergence, hard pruning is applied using the final learned mask. Complete mixed-dimension tables are constructed, with all-zero rows dropped. For post-pruning retraining, PEP leverages the “winning-ticket” strategy: masking the initial random seed of the embedding table, then fine-tuning with that fixed mask.

Plug-in deployment is accomplished by replacing standard embedding layers with masked (sparse) versions; no further model or data pipeline modifications are necessary.

4. Empirical Results and Efficiency-Effectiveness Trade-offs

On MSMARCO passage ranking and TREC 2019 Deep Learning, using PEP (pruning from $f_Q$ 1 to $f_Q$ 2 query embeddings):

MRR@10: $f_Q$ 3 (baseline $f_Q$ 4, no significant drop)
$f_Q$ 5 (avg. candidates): $f_Q$ 6 (down from $f_Q$ 7, $f_Q$ 8)
End-to-end latency: $f_Q$ 9 ms (baseline $m$ 0 ms, $m$ 1 speedup)
TREC metrics nDCG@10, MAP: unchanged

| p | MRR@10 | |R| (candidates) | Latency (ms) | |----|--------|-----------------|--------------| | 32 | 0.323 | 7000 | 461.4 | | 8 | 0.322 | 4200 | 290.1 | | 4 | 0.322 | 2600 | 200.5 | | 3 | 0.323 | 2100 | 173.7 | | 2 | 0.318 | 1500 | 140.2 |

This demonstrates that discarding low-ICF embeddings in the candidate retrieval stage preserves retrieval quality while drastically reducing candidate pool size and latency.

PEP achieves 97–99% reduction in embedding parameters on Criteo, MovieLens-1M, and Avazu with negligible (sometimes even improved) AUC. For Criteo/FM, an "extreme" profile yields:

Variant	AUC	Embedding-Params	Reduction
Uniform Embedding ( $m$ 2)	0.7890	$m$ 3	0%
DartsEmb (best point)	0.7940	$m$ 4	93%
PEP (extreme point)	0.7941	$m$ 5	99.998%
PEP (balanced)	0.7950	$m$ 6	98.1%

PEP also incurs only 20–30% time overhead relative to the base model, and all operations (lookup, masking, retraining) are compatible with standard dataflow and parameter optimization.

5. Theoretical and Practical Insights

In dense retrieval, pruned embeddings primarily correspond to frequent, non-discriminative tokens; rare-token embeddings, selected by ICF, capture the relevant retrieval signal and yield the full pool of relevant documents early. ColBERT’s MaxSim operator magnifies the contribution of salient query tokens, further supporting the efficacy of the pruning strategy (Tonellotto et al., 2021).

For recommender systems, moving from the discrete choice of per-feature embedding size to learnable pruning thresholds introduces a smooth, end-to-end differentiable solution to embedding-parameter redundancy. The approach flexibly supports various granularities for threshold sharing, and the final packed embedding tables enable substantial memory and compute savings (Liu et al., 2021).

6. Limitations, Modularity, and Extensions

Both applications of PEP exhibit plug-in modularity: query-time for retrieval, model construction time for recommender systems. A fixed pruning budget (number of query embeddings $m$ 7 or $m$ 8 threshold $m$ 9) may under- or over-prune in some cases. Potential extensions include:

Adaptive pruning budgets based on query or feature characteristics.
Alternative salience criteria beyond collection frequency or simple masking, such as contextual importance.
Pruning at index-time (document-side for retrieval).
Generalization of the threshold-learning framework to other structured parameter settings.

A plausible implication is that plug-in embedding pruning frameworks can be further extended to other domains requiring large embedding tables or multiple embedded inputs, yielding efficiency gains with minimal architectural re-engineering.

7. Summary

Plug-in Embedding Pruning provides principled, low-overhead approaches to reduce embedding computation in both retrieval and recommendation. It achieves this by selecting salient embeddings during query processing (Tonellotto et al., 2021) or by learning and applying feature-specific or parameter-wise thresholds (Liu et al., 2021), yielding dramatic efficiency gains with little or no loss in accuracy. The modularity of PEP methods enables easy integration into existing systems, scaling to real-world industrial settings without cumbersome re-design.

Markdown Report Issue Upgrade to Chat

References (2)

Query Embedding Pruning for Dense Retrieval (2021)

Learnable Embedding Sizes for Recommender Systems (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Plug-in Embedding Pruning (PEP).

Plug-in Embedding Pruning (PEP)

1. Notational Foundations and Problem Settings

2. PEP for Dense Retrieval: Embedding Pruning at Query Time

Pruning Criterion and Algorithmic Procedure

3. PEP for Recommendation: Learnable Embedding Parameter Pruning

4. Empirical Results and Efficiency-Effectiveness Trade-offs

Dense Retrieval (Tonellotto et al., 2021)

Recommendation (Liu et al., 2021)

5. Theoretical and Practical Insights

6. Limitations, Modularity, and Extensions

7. Summary

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Plug-in Embedding Pruning (PEP)

1. Notational Foundations and Problem Settings

2. PEP for Dense Retrieval: Embedding Pruning at Query Time

Pruning Criterion and Algorithmic Procedure

3. PEP for Recommendation: Learnable Embedding Parameter Pruning

4. Empirical Results and Efficiency-Effectiveness Trade-offs

Dense Retrieval (Tonellotto et al., 2021)

Recommendation (Liu et al., 2021)

5. Theoretical and Practical Insights

6. Limitations, Modularity, and Extensions

7. Summary

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics