Papers
Topics
Authors
Recent
Search
2000 character limit reached

Plug-in Embedding Pruning (PEP)

Updated 6 March 2026
  • Plug-in Embedding Pruning (PEP) is a method that prunes embedding representations via adaptive plug-in modules in both dense retrieval and recommendation systems.
  • It employs salience criteria such as inverse collection frequency and learnable masking thresholds to discard redundant embeddings while retaining performance.
  • PEP achieves significant efficiency gains, including up to 70% reduction in latency for retrieval and nearly 99.998% parameter reduction for recommendation tasks.

Plug-in Embedding Pruning (PEP) refers to a class of methods that prune or sparsify embedding representations in deep learning models via an adaptive, easily-integrated plug-in mechanism. Two major instantiations have been proposed with distinct problem settings: (1) query embedding pruning for efficient dense retrieval (Tonellotto et al., 2021), and (2) embedding parameter pruning for memory-efficient recommender systems (Liu et al., 2021). Both employ plug-in design and embedding-wise salience or redundancy criteria to achieve dramatic efficiency gains with minimal or no reduction in effectiveness.

1. Notational Foundations and Problem Settings

Let VV denote the vocabulary or set of feature IDs; dd is embedding dimensionality. For dense retrieval (ColBERT) (Tonellotto et al., 2021), a query q=t1,,tnVnq = \langle t_1,\ldots,t_n\rangle \in V^n is mapped by an encoder fQf_Q to mm query-token embeddings, {ϕi}i=1m=fQ(t1,,tn)Rm×d\{\phi_i\}_{i=1}^m = f_Q(t_1,\ldots,t_n) \in \mathbb{R}^{m\times d}; a document d=u1,,uLd = \langle u_1,\ldots,u_L\rangle is similarly mapped to {ψj}j=1L=fD(u1,,uL)RL×d\{\psi_j\}_{j=1}^L = f_D(u_1,\ldots,u_L) \in \mathbb{R}^{L\times d}. In recommendation (Liu et al., 2021), each feature field ii has an embedding table ViRni×dV_i\in\mathbb{R}^{n_i\times d}; inputs are D={(xj,yj)}\mathcal{D}=\{(x_j,y_j)\}.

2. PEP for Dense Retrieval: Embedding Pruning at Query Time

In the context of ColBERT, PEP reduces the number of query-token embeddings used for Approximate Nearest Neighbour (ANN) candidate retrieval. The ColBERT scoring function is

s(q,d)=i=1mmax1jLϕi,ψj,s(q,d) = \sum_{i=1}^m \max_{1\le j\le L} \langle \phi_i, \psi_j \rangle,

where ,\langle\cdot,\cdot\rangle is the dot-product.

Candidate retrieval is handled by FAISS-based ANN search: for each query embedding ϕi\phi_i, retrieve its kk' nearest document embeddings Ψ(ϕi,k)\Psi(\phi_i, k'); form candidate document sets

Di(k)={d:fD(d)Ψ(ϕi,k)},D(k)=i=1mDi(k).D_i(k') = \{d : f_D(d) \cap \Psi(\phi_i, k') \neq \varnothing \}, \qquad D(k') = \bigcup_{i=1}^m D_i(k').

Then all documents in D(k)D(k') are scored exactly with the full MaxSim (s(q,d)s(q,d)) operator.

Pruning Criterion and Algorithmic Procedure

PEP ranks query embeddings ϕi\phi_i by the Inverse Collection Frequency (ICF) of their corresponding tokens: S(qi)=ICF(qi)=1/CF(qi)S(q_i) = \mathrm{ICF}(q_i) = 1/\mathrm{CF}(q_i), where CF(qi)\mathrm{CF}(q_i) is collection frequency, or equivalently S(qi)=log(CF(qi))S(q_i) = -\log(\mathrm{CF}(q_i)). The pp embeddings with highest ICF are selected for candidate retrieval. The remaining, less-informative embeddings are omitted from the ANN search but all mm are reinstated for the final exact scoring.

The plug-in nature is reflected by its full query-time modularity: PEP is inserted after token encoding, before ANN search, requiring no retraining or index rebuilding.

3. PEP for Recommendation: Learnable Embedding Parameter Pruning

In recommendation models, the scale of embedding tables for high-cardinality categorical features is a principal bottleneck (Liu et al., 2021). PEP formulates embedding pruning as L0L_0-constrained optimization: minV,ΘL(V,Θ;D)s.t.V0k,\min_{V,\Theta} \mathcal{L}(V, \Theta; \mathcal{D}) \quad \text{s.t.} \quad \|V\|_0 \leq k, where V0\|V\|_0 counts nonzero entries. To permit gradient-based optimization, a differentiable masking function S(V,s)\mathcal{S}(V,s) is constructed: V^=sign(V)max(Vg(s),0),\widehat V = \operatorname{sign}(V) \max(|V| - g(s), 0), with learnable thresholds ss parameterized at desired granularity (global, dimension-wise, feature-wise, or feature-dimension-wise).

During training, gradient updates naturally steer threshold variables to prune redundant components; at convergence, hard pruning is applied using the final learned mask. Complete mixed-dimension tables are constructed, with all-zero rows dropped. For post-pruning retraining, PEP leverages the “winning-ticket” strategy: masking the initial random seed of the embedding table, then fine-tuning with that fixed mask.

Plug-in deployment is accomplished by replacing standard embedding layers with masked (sparse) versions; no further model or data pipeline modifications are necessary.

4. Empirical Results and Efficiency-Effectiveness Trade-offs

On MSMARCO passage ranking and TREC 2019 Deep Learning, using PEP (pruning from m=32m=32 to p=3p=3 query embeddings):

  • MRR@10: $0.323$ (baseline =0.323=0.323, no significant drop)
  • R|R| (avg. candidates): $2100$ (down from $7000$, 70%-70\%)
  • End-to-end latency: $173.7$ ms (baseline =461.4= 461.4 ms, 2.65×2.65\times speedup)
  • TREC metrics nDCG@10, MAP: unchanged

| p | MRR@10 | |R| (candidates) | Latency (ms) | |----|--------|-----------------|--------------| | 32 | 0.323 | 7000 | 461.4 | | 8 | 0.322 | 4200 | 290.1 | | 4 | 0.322 | 2600 | 200.5 | | 3 | 0.323 | 2100 | 173.7 | | 2 | 0.318 | 1500 | 140.2 |

This demonstrates that discarding low-ICF embeddings in the candidate retrieval stage preserves retrieval quality while drastically reducing candidate pool size and latency.

PEP achieves 97–99% reduction in embedding parameters on Criteo, MovieLens-1M, and Avazu with negligible (sometimes even improved) AUC. For Criteo/FM, an "extreme" profile yields:

Variant AUC Embedding-Params Reduction
Uniform Embedding (d=64d=64) 0.7890 64×10664 \times 10^6 0%
DartsEmb (best point) 0.7940 4.5×1064.5 \times 10^6 93%
PEP (extreme point) 0.7941 1.1×1031.1\times 10^3 99.998%
PEP (balanced) 0.7950 1.2×1051.2\times 10^5 98.1%

PEP also incurs only 20–30% time overhead relative to the base model, and all operations (lookup, masking, retraining) are compatible with standard dataflow and parameter optimization.

5. Theoretical and Practical Insights

In dense retrieval, pruned embeddings primarily correspond to frequent, non-discriminative tokens; rare-token embeddings, selected by ICF, capture the relevant retrieval signal and yield the full pool of relevant documents early. ColBERT’s MaxSim operator magnifies the contribution of salient query tokens, further supporting the efficacy of the pruning strategy (Tonellotto et al., 2021).

For recommender systems, moving from the discrete choice of per-feature embedding size to learnable pruning thresholds introduces a smooth, end-to-end differentiable solution to embedding-parameter redundancy. The approach flexibly supports various granularities for threshold sharing, and the final packed embedding tables enable substantial memory and compute savings (Liu et al., 2021).

6. Limitations, Modularity, and Extensions

Both applications of PEP exhibit plug-in modularity: query-time for retrieval, model construction time for recommender systems. A fixed pruning budget (number of query embeddings pp or L0L_0 threshold kk) may under- or over-prune in some cases. Potential extensions include:

  • Adaptive pruning budgets based on query or feature characteristics.
  • Alternative salience criteria beyond collection frequency or simple masking, such as contextual importance.
  • Pruning at index-time (document-side for retrieval).
  • Generalization of the threshold-learning framework to other structured parameter settings.

A plausible implication is that plug-in embedding pruning frameworks can be further extended to other domains requiring large embedding tables or multiple embedded inputs, yielding efficiency gains with minimal architectural re-engineering.

7. Summary

Plug-in Embedding Pruning provides principled, low-overhead approaches to reduce embedding computation in both retrieval and recommendation. It achieves this by selecting salient embeddings during query processing (Tonellotto et al., 2021) or by learning and applying feature-specific or parameter-wise thresholds (Liu et al., 2021), yielding dramatic efficiency gains with little or no loss in accuracy. The modularity of PEP methods enables easy integration into existing systems, scaling to real-world industrial settings without cumbersome re-design.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Plug-in Embedding Pruning (PEP).