BERT-PSIE: Progressive Self-Attention Elimination
- The paper introduces a novel method that leverages self-attention for token significance scoring to dynamically eliminate less important tokens and reduce inference latency.
- It employs two extraction modes—soft for configuration search via continuous retention parameters and hard for final inference, ensuring an optimal speed-accuracy trade-off.
- Evaluations on the GLUE benchmark show that PSIE achieves up to 4.5× speedup with less than 1% accuracy drop compared to BERT-base, outperforming related acceleration techniques.
BERT-PSIE (Progressive Self-Attention Importance Elimination) refers to a structured compression and acceleration framework for BERT models, introduced within the PoWER-BERT method. PSIE systematically reduces inference latency by exploiting redundancy in BERT's layerwise token representations. It achieves this by dynamically identifying and removing token embeddings—termed "word-vectors"—of low relative importance during the forward pass, while tightly controlling the trade-off between speed and accuracy through an auxiliary optimization over both which tokens to drop and how many to retain at each layer (Goyal et al., 2020).
1. Self-Attention–Derived Significance Scoring
PSIE defines a measure of token significance directly from the self-attention mechanism. At each encoder layer and head , the attention matrix is constructed as
for input . Interpreting as the attention flowing from token to , the significance of token under head is defined as the sum of attention it sends: Aggregating across all 12 heads in the BERT base model,
$\sigma(w) = \sum_{h=1}^{12} \sigma_h(w) = \sum_{h=1}^{12}\sum_{w'} A_h[w', w}$
Tokens with high are interpreted as being highly influential, as their embeddings are referenced frequently by other tokens throughout the model.
2. Progressive Token Elimination Architecture
PSIE modifies the canonical 12-layer BERT stack by introducing variable-width "extract" operations between self-attention and the feed-forward sublayer at each encoder layer. Given input of token-vectors, only the top are retained for further processing. PSIE employs two extraction modes:
- Hard-Extract: In final inference and retraining phases, is computed for all tokens. The top (by score, always preserving the [CLS] token) are selected, and the remaining rows are discarded.
- Soft-Extract: Used during configuration search, where learnable retention parameters act as gates on the sorted token scores. Each row of the attention output is multiplicatively scaled by its corresponding value, allowing continuous (non-binary) retention. The parameters are optimized such that mass concentrates on the most significant tokens.
3. Layerwise Optimization of Token Retention
The selection of how many tokens to keep at each layer is learned by augmenting the standard BERT loss with an penalty on the total "retained mass" at each layer:
The weighting factor encourages deeper layers (where token diffusion is strongest) to be pruned more aggressively. After several epochs of joint optimization by ADAM, a discrete retention configuration is extracted for each layer by setting
Soft-extract layers are then replaced with these hard quotas, and the model is further retrained to recover any accuracy loss from the hard elimination.
4. Inference Pipeline
Once the retention schedule is fixed, inference proceeds as follows:
- Input is tokenized and embedded to size .
- For each layer :
- Self-attention is computed as usual.
- scores are determined for all remaining tokens.
- The top tokens (by , always including [CLS]) are selected.
- The reduced embedding matrix (of size ) proceeds to the next layer.
- Final classification is performed using only the [CLS] embedding.
This iterative reduction decreases both the number of attention computations and the number of feed-forward operations per layer. In practice, aggregate computation is reduced by a factor of approximately , with observed speedups up to 4.5× under standard task accuracy budgets.
5. Evaluation on GLUE and Comparative Performance
On the GLUE benchmark, PSIE was evaluated for BERT-base with tuned to maintain accuracy within 1% of fine-tuned BERT-base. The following table summarizes results for a selection of representative tasks:
| Task | Metric | BERT-base | PSIE | Speedup |
|---|---|---|---|---|
| CoLA | Matt. Corr. | 52.5 | 52.3 | 4.5× |
| RTE | Acc. | 68.1 | 67.4 | 3.4× |
| QQP | F1 | 71.2 | 70.2 | 4.5× |
| MRPC | F1 | 88.7 | 88.1 | 2.7× |
| SST-2 | Acc. | 93.0 | 92.1 | 2.4× |
| MNLI-m | Acc. | 84.6 | 83.8 | 2.6× |
| QNLI | Acc. | 91.0 | 90.1 | 2.0× |
| STS-B | Spearman | 85.8 | 85.1 | 2.0× |
PSIE outperforms other acceleration methods such as DistilBERT, PKD-BERT, and head-pruning. At matched accuracy, PSIE achieves 1.5–3× lower latency; at fixed latency, it surpasses these approaches by as much as 10 GLUE points.
6. Training and Deployment Workflow
The PSIE workflow consists of three primary stages:
- Task fine-tuning: Standard BERT fine-tuning on the downstream task.
- Configuration search with Soft-Extract: Introduction of the retention parameters , joint optimization over BERT weights and retention gates with the sparsity-regularized loss, and reading off hard retention schedules.
- Retraining with Hard-Extract: Substitution of soft gates with discrete token selections, retraining to recover any accuracy loss.
- Inference: Progressive elimination and embedding reduction as described above.
The number of new parameters is negligible (one per token per layer during configuration search), and the total training overhead is minimal, adding at most three epochs beyond standard fine-tuning. At inference, PSIE typically processes 3–6× fewer token-vectors while maintaining tight accuracy bounds (Goyal et al., 2020).
7. Context, Practical Impact, and Comparison
The PSIE framework formalizes and automates a token-level sparsity mechanism well aligned with empirical observations of token “diffusion” in BERT’s encoder stack. By integrating self-attention scores into the elimination logic and learning both token ranking and retention counts as optimizable parameters, PSIE attains substantial computational savings with negligible loss in task performance. In extensive evaluation on GLUE, it consistently achieves superior latency-accuracy trade-offs relative to prior methods involving model distillation, partial knowledge distillation, or head pruning. A plausible implication is that progressive, data-driven pruning at the token level represents a more fine-grained and effective compression axis for modern transformer architectures than coarse architectural modifications or post-hoc distillation.