Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mix-Precision Multi-Round Filtering (MP-MRF)

Updated 8 February 2026
  • Mix-Precision Multi-Round Filtering (MP-MRF) is a dynamic algorithm that filters transformer query–key pairs in multiple low-bit precision rounds to achieve efficient attention computation.
  • It employs sequential candidate pruning with adaptive thresholding and progressively higher bitwidths, maintaining high accuracy while drastically reducing compute and energy costs.
  • Integrated into the Energon system, MP-MRF enables substantial performance, memory, and energy savings on resource-constrained devices without the need for model retraining.

Mix-Precision Multi-Round Filtering (MP-MRF) is a runtime algorithm for dynamically identifying high-importance query–key pairs in transformer attention, with the objective of drastically reducing both computational and memory requirements. MP-MRF achieves this by performing attention filtering in multiple precision-adaptive rounds, each using low-bitwidth arithmetic, before computing the final attention on a pruned and sparsified set of keys at full precision. This approach retains the accuracy of dense attention while facilitating significant energy savings and performance gains, especially on resource-constrained or edge platforms. MP-MRF is the cornerstone of the Energon system, a full-stack co-design hardware architecture for efficient transformer acceleration (Zhou et al., 2021).

1. Motivation and Context

The computational bottleneck in transformer models arises from the quadratic complexity of standard dense attention: for sequence length nn and head dimension dd, the attention operation requires O(n2d)O(n^2 d) multiply–accumulate operations (MACs) and O(n2)O(n^2) data movement. This cost is prohibitive for edge and mobile inference. Prior approaches such as top-kk search or cascaded token-pruning often provide limited computational relief or result in severe accuracy loss unless extensive retraining is performed.

MP-MRF addresses these shortcomings by providing a dynamic, runtime pruning mechanism that is hardware-friendly and requires no retraining. It leverages the empirical observation that the softmax attention output is dominated by a small and variable subset of query–key pairs, suggesting that aggressive runtime pruning is possible if high-importance pairs can be identified accurately and efficiently.

2. Algorithmic Description

The MP-MRF algorithm operates in RR sequential rounds (typically R=2R=2), each at progressively higher bitwidth, followed by a final high-precision sparse attention computation. The process is conducted independently for each query and attention head.

Workflow Steps:

  1. Quantized Storage: Q,K,VQ,K,V tensors (shape n×dn \times d) are stored as INT16 arrays.
  2. Bitwidth Truncation: At filtering round rr, only the top lrl_r bits of each value in QQ and KK are retained, yielding Q(r)Q^{(r)} and K(r)K^{(r)} (e.g., INT2 for round 0, INT4 for round 1).
  3. Candidate Maintenance: A per-query candidate key index list, KidxK_{\text{idx}} (initialized to {0,,n1}\{0,\dots,n-1\}), tracks surviving key indices through rounds.
  4. Scoring: For each surviving candidate jKidxj \in K_{\text{idx}}, compute an approximate score Si(r)[j]=Qi(r)Kj(r)S^{(r)}_i[j] = Q^{(r)}_i \cdot K^{(r)}_j using lrl_r-bit arithmetic.
  5. Thresholding: Compute a round-dependent threshold τi(r)\tau^{(r)}_i per query (see §3 below) and update KidxK_{\text{idx}} to retain only those jj for which Si(r)[j]>τi(r)S^{(r)}_i[j] > \tau^{(r)}_i.
  6. Final Sparse Attention: Using Qfull,Ksel,VselQ_{\text{full}},K_{\text{sel}},V_{\text{sel}} (full-precision INT16), compute final attention on the pruned candidate set:

Aprobs=Softmax(QfullKseld)A_{\text{probs}} = \text{Softmax}\left(\frac{Q_{\text{full}}\cdot K_{\text{sel}}^\top}{\sqrt{d}}\right)

Default Parameters: R=2R=2 filtering rounds; (l0,l1)=(2,4)(l_0,l_1) = (2,4); threshold parameters α0,α1\alpha_0,\alpha_1 tuned on development data.

3. Mathematical Formulation

MP-MRF's filtering is implemented via approximate dot products and a tunable thresholding rule in each round.

  • Score Calculation:

Si(r)[j]=Qi(r)Kj(r)S^{(r)}_i[j] = Q^{(r)}_i \cdot K^{(r)}_j

  • Thresholding Scheme: For each query ii and round rr, define:

μrmeanjSi(r)[j] MrmaxjSi(r)[j] mrminjSi(r)[j]\begin{align*} \mu_r &\equiv \text{mean}_j\, S^{(r)}_i[j] \ M_r &\equiv \max_j S^{(r)}_i[j] \ m_r &\equiv \min_j S^{(r)}_i[j] \end{align*}

The threshold τi(r)\tau^{(r)}_i is given by:

if αr0:τi(r)=αrMr+(1αr)μr if αr<0:τi(r)=αrmr+(1+αr)μr\text{if } \alpha_r \ge 0:\quad \tau^{(r)}_i = \alpha_r M_r + (1-\alpha_r)\mu_r\ \text{if } \alpha_r < 0:\quad \tau^{(r)}_i = -\alpha_r m_r + (1+\alpha_r)\mu_r

Here, the tunable parameter αr(1,1)\alpha_r \in (-1,1) controls pruning tightness, with the threshold sweeping between mean and extremal values for 0αr<10 \leq |\alpha_r| < 1.

  • Filtering Complexity: Denoting γ\gamma as the candidate fraction after round 0 and β\beta after round 1:

Round 0:O(n2dl0)  bit-ops Round 1:O(n2dγl1) Final:O(n2dβ)  (full precision)\begin{align*} \text{Round 0:} &\quad O(n^2 d\, l_0) \; \text{bit-ops} \ \text{Round 1:} &\quad O(n^2 d\, \gamma l_1) \ \text{Final:} &\quad O(n^2 d\, \beta) \;\text{(full precision)} \end{align*}

For β0.1250.25\beta \approx 0.125-0.25 (4–8×\times pruning), and l0=2l_0=2, l1=4l_1=4, the total MAC energy is reduced by $4$–8×8\times, with an extra 2×2\times energy cut from low-bitwidth filtering (Zhou et al., 2021).

4. Empirical Accuracy, Efficiency, and Error Behavior

Empirical evaluations demonstrate that MP-MRF, with R=2R=2, (l0,l1)=(2,4)(l_0,l_1)=(2,4), and tuned α\alpha values, achieves up to 8–16×\times overall pruning with negligible accuracy loss: under 0.5% drop in F1 (SQuAD), \leq0.2 increase in perplexity (Wikitext), and under 0.5% top-1 accuracy loss on ImageNet.

MP-MRF achieves 91–97% "top-kk coverage" compared to exhaustive sort, even at aggressive reduction. The multi-round design minimizes the risk of losing important pairs: initial coarse rounds at very low bitwidth may temporarily discard some candidates, but higher-precision later rounds re-evaluate survivors, preventing irreversible loss of key information. The method is fully compatible with pretrained models and requires no retraining.

5. Hardware Implementation in the Energon Co-Processor

MP-MRF maps efficiently onto specialized hardware. In the Energon co-processor architecture, operations are distributed across Filtering Units (FUs) and Attention Units (AUs):

  • Filtering Unit (FU): On-chip buffers manage quantized KK bits; multi-precision processing elements (PEs) can reuse the same 4-bit multipliers for both filtering rounds. The selector module computes statistics (min, max, mean) and performs parallel candidate pruning. Full pipelining is used, typically employing 8–64 PEs per query.
  • Attention Unit (AU): Fetches only the pruned K,VK,V vectors ("on-demand fetch") to small local buffers. MAC arrays perform final computations and softmax is implemented using parallel exp–Taylor approximation. Double-buffering overlaps computation and data movement.
  • Energy Optimizations: Early-round INT2/INT4 filtering and DRAM access reduction (by up to 70–90%) lead to 1.3–1.5×\times DRAM energy savings.

A summary of the MP-MRF pipeline in hardware:

Stage Bitwidth Primary Function
Filtering Round 0 2 Coarse candidate pruning
Filtering Round 1 4 Finer pruning of survivors
Final Sparse Attention 16 Accurate attention over pruned set

6. Comparative Results and Performance Benchmarks

On ARM A72, Energon-edge with MP-MRF reduces BERT-base attention latency by 73×\times–3057×\times (sequence length n=512n=512), with MP-MRF alone accounting for 8.3×\times of the gain relative to dense int16 attention. On embedded NVIDIA Jetson-TX2, speedups of 3.4×\times–764×\times are observed. The Energon-server system achieves a 168×\times geometric mean speedup over Xeon 5220 and 8.7×\times over NVIDIA V100 for dense attention.

Energon-edge draws 2.7 W total (vs. 4 W CPU + DRAM or 0.9 W DRAM on Jetson), yielding 10³–10⁴×\times energy reductions relative to conventional CPU/GPU methods for attention. Within the accelerator, MP-MRF brings a direct energy saving of about 3.7×\times, with on-demand fetch providing an additional 1.3×\times.

When compared to prior sparse attention methods:

  • SpAtten (token pruning, no retraining): At 16×\times pruning, MP-MRF delivers more than twice the accuracy; for equal accuracy, 5.3×\times higher pruning (\to 1.7×\times throughput and 1.6×\times energy efficiency).
  • A3^3 (8-bit approximate attention): Energon’s approach saves an additional 35% DRAM reads at fixed pruning ratios (1.35×\times DRAM and 1.5×\times core energy reduction), with a modest accuracy gain (Zhou et al., 2021).

7. Significance and Applicability

Mix-Precision Multi-Round Filtering provides a practical approximation to top-kk attention, combining computational efficiency, hardware-compatibility, and accuracy retention. Its incremental low-bitwidth filtering paradigm enables aggressive pruning without information loss, eliminating the need for retraining or elaborate masking schemes. MP-MRF is especially significant for edge deployments and real-time applications where memory bandwidth and compute power are limited. Its integration in Energon demonstrates that hardware–algorithm co-design can yield order-of-magnitude improvements in both speed and energy, bridging the gap between state-of-the-art AI models and deployable, cost-effective edge inference (Zhou et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mix-Precision Multi-Round Filtering (MP-MRF).