BERT-PSIE: Progressive Self-Attention Elimination

Updated 17 December 2025

The paper introduces a novel method that leverages self-attention for token significance scoring to dynamically eliminate less important tokens and reduce inference latency.
It employs two extraction modes—soft for configuration search via continuous retention parameters and hard for final inference, ensuring an optimal speed-accuracy trade-off.
Evaluations on the GLUE benchmark show that PSIE achieves up to 4.5× speedup with less than 1% accuracy drop compared to BERT-base, outperforming related acceleration techniques.

BERT-PSIE (Progressive Self-Attention Importance Elimination) refers to a structured compression and acceleration framework for BERT models, introduced within the PoWER-BERT method. PSIE systematically reduces inference latency by exploiting redundancy in BERT's layerwise token representations. It achieves this by dynamically identifying and removing token embeddings—termed "word-vectors"—of low relative importance during the forward pass, while tightly controlling the trade-off between speed and accuracy through an auxiliary optimization over both which tokens to drop and how many to retain at each layer (Goyal et al., 2020).

1. Self-Attention–Derived Significance Scoring

PSIE defines a measure of token significance directly from the self-attention mechanism. At each encoder layer $j$ and head $h$ , the attention matrix is constructed as

$A_h = \mathrm{softmax}\big((M W^Q_h)(M W^K_h)^\top\big)$

for input $M \in \mathbb{R}^{\ell_{j-1} \times H}$ . Interpreting $A_h[w', w]$ as the attention flowing from token $w$ to $w'$ , the significance of token $w$ under head $h$ is defined as the sum of attention it sends: $\sigma_h(w) = \sum_{w'=1}^{\ell_{j-1}} A_h[w', w]$ Aggregating across all 12 heads in the BERT base model,

$\sigma(w) = \sum_{h=1}^{12} \sigma_h(w) = \sum_{h=1}^{12}\sum_{w'} A_h[w', w}$

Tokens with high $\sigma(w)$ are interpreted as being highly influential, as their embeddings are referenced frequently by other tokens throughout the model.

2. Progressive Token Elimination Architecture

PSIE modifies the canonical 12-layer BERT stack by introducing variable-width "extract" operations between self-attention and the feed-forward sublayer at each encoder layer. Given input of $\ell_{j-1}$ token-vectors, only the top $\ell_j \leq \ell_{j-1}$ are retained for further processing. PSIE employs two extraction modes:

Hard-Extract: In final inference and retraining phases, $\sigma(w)$ is computed for all tokens. The top $\ell_j$ (by score, always preserving the [CLS] token) are selected, and the remaining rows are discarded.
Soft-Extract: Used during configuration search, where learnable retention parameters $\pi_j[1], \dots, \pi_j[N_{\max}] \in [0, 1]$ act as gates on the sorted token scores. Each row of the attention output is multiplicatively scaled by its corresponding $\pi_j$ value, allowing continuous (non-binary) retention. The parameters are optimized such that mass concentrates on the most significant tokens.

3. Layerwise Optimization of Token Retention

The selection of how many tokens $\ell_j$ to keep at each layer is learned by augmenting the standard BERT loss $\mathcal{L}(\theta)$ with an $\ell_1$ penalty on the total "retained mass" at each layer: $\mathrm{mass}(j; \pi) = \sum_{k=1}^{N_{\max}} \pi_j[k]$

$\min_{\theta, \{\pi_j\}} \left[ \mathcal{L}(\theta, \{\pi_j\}) + \lambda \sum_{j=1}^{12} j \sum_{k=1}^{N_{\max}} \pi_j[k] \right] \qquad\text{subject to}~\pi_j[k]\in [0,1]$

The weighting factor $j$ encourages deeper layers (where token diffusion is strongest) to be pruned more aggressively. After several epochs of joint optimization by ADAM, a discrete retention configuration is extracted for each layer by setting

$\ell_j = \lceil \mathrm{mass}(j;\pi) \rceil$

Soft-extract layers are then replaced with these hard quotas, and the model is further retrained to recover any accuracy loss from the hard elimination.

4. Inference Pipeline

Once the retention schedule $\{\ell_j\}$ is fixed, inference proceeds as follows:

Input is tokenized and embedded to size $N_{\max} \times H$ .
For each layer $j=1\ldots12$ $j = 1 \dots 12$ :
- Self-attention is computed as usual.
- $\sigma(w)$ scores are determined for all remaining tokens.
- The top $\ell_j$ tokens (by $\sigma$ , always including [CLS]) are selected.
- The reduced embedding matrix (of size $\ell_j \times H$ ) proceeds to the next layer.
Final classification is performed using only the [CLS] embedding.

This iterative reduction decreases both the number of attention computations and the number of feed-forward operations per layer. In practice, aggregate computation is reduced by a factor of approximately $\sum_j \ell_j/ \sum_j N_{\max}$ , with observed speedups up to 4.5× under standard task accuracy budgets.

5. Evaluation on GLUE and Comparative Performance

On the GLUE benchmark, PSIE was evaluated for BERT-base with $\lambda$ tuned to maintain accuracy within 1% of fine-tuned BERT-base. The following table summarizes results for a selection of representative tasks:

Task	Metric	BERT-base	PSIE	Speedup
CoLA	Matt. Corr.	52.5	52.3	4.5×
RTE	Acc.	68.1	67.4	3.4×
QQP	F1	71.2	70.2	4.5×
MRPC	F1	88.7	88.1	2.7×
SST-2	Acc.	93.0	92.1	2.4×
MNLI-m	Acc.	84.6	83.8	2.6×
QNLI	Acc.	91.0	90.1	2.0×
STS-B	Spearman	85.8	85.1	2.0×

PSIE outperforms other acceleration methods such as DistilBERT, PKD-BERT, and head-pruning. At matched accuracy, PSIE achieves 1.5–3× lower latency; at fixed latency, it surpasses these approaches by as much as 10 GLUE points.

6. Training and Deployment Workflow

The PSIE workflow consists of three primary stages:

Task fine-tuning: Standard BERT fine-tuning on the downstream task.
Configuration search with Soft-Extract: Introduction of the retention parameters $\{\pi_j[k]\}$ , joint optimization over BERT weights and retention gates with the sparsity-regularized loss, and reading off hard retention schedules.
Retraining with Hard-Extract: Substitution of soft gates with discrete token selections, retraining to recover any accuracy loss.
Inference: Progressive elimination and embedding reduction as described above.

The number of new parameters is negligible (one $\pi_j[k]$ per token per layer during configuration search), and the total training overhead is minimal, adding at most three epochs beyond standard fine-tuning. At inference, PSIE typically processes 3–6× fewer token-vectors while maintaining tight accuracy bounds (Goyal et al., 2020).

7. Context, Practical Impact, and Comparison

The PSIE framework formalizes and automates a token-level sparsity mechanism well aligned with empirical observations of token “diffusion” in BERT’s encoder stack. By integrating self-attention scores into the elimination logic and learning both token ranking and retention counts as optimizable parameters, PSIE attains substantial computational savings with negligible loss in task performance. In extensive evaluation on GLUE, it consistently achieves superior latency-accuracy trade-offs relative to prior methods involving model distillation, partial knowledge distillation, or head pruning. A plausible implication is that progressive, data-driven pruning at the token level represents a more fine-grained and effective compression axis for modern transformer architectures than coarse architectural modifications or post-hoc distillation.

Markdown Report Issue Upgrade to Chat

References (1)

PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BERT-PSIE.

BERT-PSIE: Progressive Self-Attention Elimination

1. Self-Attention–Derived Significance Scoring

2. Progressive Token Elimination Architecture

3. Layerwise Optimization of Token Retention

4. Inference Pipeline

5. Evaluation on GLUE and Comparative Performance

6. Training and Deployment Workflow

7. Context, Practical Impact, and Comparison

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

BERT-PSIE: Progressive Self-Attention Elimination

1. Self-Attention–Derived Significance Scoring

2. Progressive Token Elimination Architecture

3. Layerwise Optimization of Token Retention

4. Inference Pipeline

5. Evaluation on GLUE and Comparative Performance

6. Training and Deployment Workflow

7. Context, Practical Impact, and Comparison

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research