Mix-Precision Multi-Round Filtering (MP-MRF)
- Mix-Precision Multi-Round Filtering (MP-MRF) is a dynamic algorithm that filters transformer query–key pairs in multiple low-bit precision rounds to achieve efficient attention computation.
- It employs sequential candidate pruning with adaptive thresholding and progressively higher bitwidths, maintaining high accuracy while drastically reducing compute and energy costs.
- Integrated into the Energon system, MP-MRF enables substantial performance, memory, and energy savings on resource-constrained devices without the need for model retraining.
Mix-Precision Multi-Round Filtering (MP-MRF) is a runtime algorithm for dynamically identifying high-importance query–key pairs in transformer attention, with the objective of drastically reducing both computational and memory requirements. MP-MRF achieves this by performing attention filtering in multiple precision-adaptive rounds, each using low-bitwidth arithmetic, before computing the final attention on a pruned and sparsified set of keys at full precision. This approach retains the accuracy of dense attention while facilitating significant energy savings and performance gains, especially on resource-constrained or edge platforms. MP-MRF is the cornerstone of the Energon system, a full-stack co-design hardware architecture for efficient transformer acceleration (Zhou et al., 2021).
1. Motivation and Context
The computational bottleneck in transformer models arises from the quadratic complexity of standard dense attention: for sequence length and head dimension , the attention operation requires multiply–accumulate operations (MACs) and data movement. This cost is prohibitive for edge and mobile inference. Prior approaches such as top- search or cascaded token-pruning often provide limited computational relief or result in severe accuracy loss unless extensive retraining is performed.
MP-MRF addresses these shortcomings by providing a dynamic, runtime pruning mechanism that is hardware-friendly and requires no retraining. It leverages the empirical observation that the softmax attention output is dominated by a small and variable subset of query–key pairs, suggesting that aggressive runtime pruning is possible if high-importance pairs can be identified accurately and efficiently.
2. Algorithmic Description
The MP-MRF algorithm operates in sequential rounds (typically ), each at progressively higher bitwidth, followed by a final high-precision sparse attention computation. The process is conducted independently for each query and attention head.
Workflow Steps:
- Quantized Storage: tensors (shape ) are stored as INT16 arrays.
- Bitwidth Truncation: At filtering round , only the top bits of each value in and are retained, yielding and (e.g., INT2 for round 0, INT4 for round 1).
- Candidate Maintenance: A per-query candidate key index list, (initialized to ), tracks surviving key indices through rounds.
- Scoring: For each surviving candidate , compute an approximate score using -bit arithmetic.
- Thresholding: Compute a round-dependent threshold per query (see §3 below) and update to retain only those for which .
- Final Sparse Attention: Using (full-precision INT16), compute final attention on the pruned candidate set:
Default Parameters: filtering rounds; ; threshold parameters tuned on development data.
3. Mathematical Formulation
MP-MRF's filtering is implemented via approximate dot products and a tunable thresholding rule in each round.
- Score Calculation:
- Thresholding Scheme: For each query and round , define:
The threshold is given by:
Here, the tunable parameter controls pruning tightness, with the threshold sweeping between mean and extremal values for .
- Filtering Complexity: Denoting as the candidate fraction after round 0 and after round 1:
For (4–8 pruning), and , , the total MAC energy is reduced by $4$–, with an extra energy cut from low-bitwidth filtering (Zhou et al., 2021).
4. Empirical Accuracy, Efficiency, and Error Behavior
Empirical evaluations demonstrate that MP-MRF, with , , and tuned values, achieves up to 8–16 overall pruning with negligible accuracy loss: under 0.5% drop in F1 (SQuAD), 0.2 increase in perplexity (Wikitext), and under 0.5% top-1 accuracy loss on ImageNet.
MP-MRF achieves 91–97% "top- coverage" compared to exhaustive sort, even at aggressive reduction. The multi-round design minimizes the risk of losing important pairs: initial coarse rounds at very low bitwidth may temporarily discard some candidates, but higher-precision later rounds re-evaluate survivors, preventing irreversible loss of key information. The method is fully compatible with pretrained models and requires no retraining.
5. Hardware Implementation in the Energon Co-Processor
MP-MRF maps efficiently onto specialized hardware. In the Energon co-processor architecture, operations are distributed across Filtering Units (FUs) and Attention Units (AUs):
- Filtering Unit (FU): On-chip buffers manage quantized bits; multi-precision processing elements (PEs) can reuse the same 4-bit multipliers for both filtering rounds. The selector module computes statistics (min, max, mean) and performs parallel candidate pruning. Full pipelining is used, typically employing 8–64 PEs per query.
- Attention Unit (AU): Fetches only the pruned vectors ("on-demand fetch") to small local buffers. MAC arrays perform final computations and softmax is implemented using parallel exp–Taylor approximation. Double-buffering overlaps computation and data movement.
- Energy Optimizations: Early-round INT2/INT4 filtering and DRAM access reduction (by up to 70–90%) lead to 1.3–1.5 DRAM energy savings.
A summary of the MP-MRF pipeline in hardware:
| Stage | Bitwidth | Primary Function |
|---|---|---|
| Filtering Round 0 | 2 | Coarse candidate pruning |
| Filtering Round 1 | 4 | Finer pruning of survivors |
| Final Sparse Attention | 16 | Accurate attention over pruned set |
6. Comparative Results and Performance Benchmarks
On ARM A72, Energon-edge with MP-MRF reduces BERT-base attention latency by 73–3057 (sequence length ), with MP-MRF alone accounting for 8.3 of the gain relative to dense int16 attention. On embedded NVIDIA Jetson-TX2, speedups of 3.4–764 are observed. The Energon-server system achieves a 168 geometric mean speedup over Xeon 5220 and 8.7 over NVIDIA V100 for dense attention.
Energon-edge draws 2.7 W total (vs. 4 W CPU + DRAM or 0.9 W DRAM on Jetson), yielding 10³–10⁴ energy reductions relative to conventional CPU/GPU methods for attention. Within the accelerator, MP-MRF brings a direct energy saving of about 3.7, with on-demand fetch providing an additional 1.3.
When compared to prior sparse attention methods:
- SpAtten (token pruning, no retraining): At 16 pruning, MP-MRF delivers more than twice the accuracy; for equal accuracy, 5.3 higher pruning ( 1.7 throughput and 1.6 energy efficiency).
- A (8-bit approximate attention): Energon’s approach saves an additional 35% DRAM reads at fixed pruning ratios (1.35 DRAM and 1.5 core energy reduction), with a modest accuracy gain (Zhou et al., 2021).
7. Significance and Applicability
Mix-Precision Multi-Round Filtering provides a practical approximation to top- attention, combining computational efficiency, hardware-compatibility, and accuracy retention. Its incremental low-bitwidth filtering paradigm enables aggressive pruning without information loss, eliminating the need for retraining or elaborate masking schemes. MP-MRF is especially significant for edge deployments and real-time applications where memory bandwidth and compute power are limited. Its integration in Energon demonstrates that hardware–algorithm co-design can yield order-of-magnitude improvements in both speed and energy, bridging the gap between state-of-the-art AI models and deployable, cost-effective edge inference (Zhou et al., 2021).