Mix-Precision Multi-Round Filtering (MP-MRF)

Updated 8 February 2026

Mix-Precision Multi-Round Filtering (MP-MRF) is a dynamic algorithm that filters transformer query–key pairs in multiple low-bit precision rounds to achieve efficient attention computation.
It employs sequential candidate pruning with adaptive thresholding and progressively higher bitwidths, maintaining high accuracy while drastically reducing compute and energy costs.
Integrated into the Energon system, MP-MRF enables substantial performance, memory, and energy savings on resource-constrained devices without the need for model retraining.

Mix-Precision Multi-Round Filtering (MP-MRF) is a runtime algorithm for dynamically identifying high-importance query–key pairs in transformer attention, with the objective of drastically reducing both computational and memory requirements. MP-MRF achieves this by performing attention filtering in multiple precision-adaptive rounds, each using low-bitwidth arithmetic, before computing the final attention on a pruned and sparsified set of keys at full precision. This approach retains the accuracy of dense attention while facilitating significant energy savings and performance gains, especially on resource-constrained or edge platforms. MP-MRF is the cornerstone of the Energon system, a full-stack co-design hardware architecture for efficient transformer acceleration (Zhou et al., 2021).

1. Motivation and Context

The computational bottleneck in transformer models arises from the quadratic complexity of standard dense attention: for sequence length $n$ and head dimension $d$ , the attention operation requires $O(n^2 d)$ multiply–accumulate operations (MACs) and $O(n^2)$ data movement. This cost is prohibitive for edge and mobile inference. Prior approaches such as top- $k$ search or cascaded token-pruning often provide limited computational relief or result in severe accuracy loss unless extensive retraining is performed.

MP-MRF addresses these shortcomings by providing a dynamic, runtime pruning mechanism that is hardware-friendly and requires no retraining. It leverages the empirical observation that the softmax attention output is dominated by a small and variable subset of query–key pairs, suggesting that aggressive runtime pruning is possible if high-importance pairs can be identified accurately and efficiently.

2. Algorithmic Description

The MP-MRF algorithm operates in $R$ sequential rounds (typically $R=2$ ), each at progressively higher bitwidth, followed by a final high-precision sparse attention computation. The process is conducted independently for each query and attention head.

Workflow Steps:

Quantized Storage: $Q,K,V$ tensors (shape $n \times d$ ) are stored as INT16 arrays.
Bitwidth Truncation: At filtering round $r$ , only the top $l_r$ bits of each value in $Q$ and $K$ are retained, yielding $Q^{(r)}$ and $K^{(r)}$ (e.g., INT2 for round 0, INT4 for round 1).
Candidate Maintenance: A per-query candidate key index list, $K_{\text{idx}}$ (initialized to $\{0,\dots,n-1\}$ ), tracks surviving key indices through rounds.
Scoring: For each surviving candidate $j \in K_{\text{idx}}$ , compute an approximate score $S^{(r)}_i[j] = Q^{(r)}_i \cdot K^{(r)}_j$ using $l_r$ -bit arithmetic.
Thresholding: Compute a round-dependent threshold $\tau^{(r)}_i$ per query (see §3 below) and update $K_{\text{idx}}$ to retain only those $j$ for which $S^{(r)}_i[j] > \tau^{(r)}_i$ .
Final Sparse Attention: Using $Q_{\text{full}},K_{\text{sel}},V_{\text{sel}}$ (full-precision INT16), compute final attention on the pruned candidate set:

$A_{\text{probs}} = \text{Softmax}\left(\frac{Q_{\text{full}}\cdot K_{\text{sel}}^\top}{\sqrt{d}}\right)$

Default Parameters: $R=2$ filtering rounds; $(l_0,l_1) = (2,4)$ ; threshold parameters $\alpha_0,\alpha_1$ tuned on development data.

3. Mathematical Formulation

MP-MRF's filtering is implemented via approximate dot products and a tunable thresholding rule in each round.

Score Calculation:

$S^{(r)}_i[j] = Q^{(r)}_i \cdot K^{(r)}_j$

Thresholding Scheme: For each query $i$ and round $r$ , define:

$\begin{align*} \mu_r &\equiv \text{mean}_j\, S^{(r)}_i[j] \ M_r &\equiv \max_j S^{(r)}_i[j] \ m_r &\equiv \min_j S^{(r)}_i[j] \end{align*}$

The threshold $\tau^{(r)}_i$ is given by:

$\text{if } \alpha_r \ge 0:\quad \tau^{(r)}_i = \alpha_r M_r + (1-\alpha_r)\mu_r\ \text{if } \alpha_r < 0:\quad \tau^{(r)}_i = -\alpha_r m_r + (1+\alpha_r)\mu_r$

Here, the tunable parameter $\alpha_r \in (-1,1)$ controls pruning tightness, with the threshold sweeping between mean and extremal values for $0 \leq |\alpha_r| < 1$ .

Filtering Complexity: Denoting $\gamma$ as the candidate fraction after round 0 and $\beta$ after round 1:

$\begin{align*} \text{Round 0:} &\quad O(n^2 d\, l_0) \; \text{bit-ops} \ \text{Round 1:} &\quad O(n^2 d\, \gamma l_1) \ \text{Final:} &\quad O(n^2 d\, \beta) \;\text{(full precision)} \end{align*}$

For $\beta \approx 0.125-0.25$ (4–8 $\times$ pruning), and $l_0=2$ , $l_1=4$ , the total MAC energy is reduced by $4$– $8\times$ , with an extra $2\times$ energy cut from low-bitwidth filtering (Zhou et al., 2021).

4. Empirical Accuracy, Efficiency, and Error Behavior

Empirical evaluations demonstrate that MP-MRF, with $R=2$ , $(l_0,l_1)=(2,4)$ , and tuned $\alpha$ values, achieves up to 8–16 $\times$ overall pruning with negligible accuracy loss: under 0.5% drop in F1 (SQuAD), $\leq$ 0.2 increase in perplexity (Wikitext), and under 0.5% top-1 accuracy loss on ImageNet.

MP-MRF achieves 91–97% "top- $k$ coverage" compared to exhaustive sort, even at aggressive reduction. The multi-round design minimizes the risk of losing important pairs: initial coarse rounds at very low bitwidth may temporarily discard some candidates, but higher-precision later rounds re-evaluate survivors, preventing irreversible loss of key information. The method is fully compatible with pretrained models and requires no retraining.

5. Hardware Implementation in the Energon Co-Processor

MP-MRF maps efficiently onto specialized hardware. In the Energon co-processor architecture, operations are distributed across Filtering Units (FUs) and Attention Units (AUs):

Filtering Unit (FU): On-chip buffers manage quantized $K$ bits; multi-precision processing elements (PEs) can reuse the same 4-bit multipliers for both filtering rounds. The selector module computes statistics (min, max, mean) and performs parallel candidate pruning. Full pipelining is used, typically employing 8–64 PEs per query.
Attention Unit (AU): Fetches only the pruned $K,V$ vectors ("on-demand fetch") to small local buffers. MAC arrays perform final computations and softmax is implemented using parallel exp–Taylor approximation. Double-buffering overlaps computation and data movement.
Energy Optimizations: Early-round INT2/INT4 filtering and DRAM access reduction (by up to 70–90%) lead to 1.3–1.5 $\times$ DRAM energy savings.

A summary of the MP-MRF pipeline in hardware:

Stage	Bitwidth	Primary Function
Filtering Round 0	2	Coarse candidate pruning
Filtering Round 1	4	Finer pruning of survivors
Final Sparse Attention	16	Accurate attention over pruned set

6. Comparative Results and Performance Benchmarks

On ARM A72, Energon-edge with MP-MRF reduces BERT-base attention latency by 73 $\times$ –3057 $\times$ (sequence length $n=512$ ), with MP-MRF alone accounting for 8.3 $\times$ of the gain relative to dense int16 attention. On embedded NVIDIA Jetson-TX2, speedups of 3.4 $\times$ –764 $\times$ are observed. The Energon-server system achieves a 168 $\times$ geometric mean speedup over Xeon 5220 and 8.7 $\times$ over NVIDIA V100 for dense attention.

Energon-edge draws 2.7 W total (vs. 4 W CPU + DRAM or 0.9 W DRAM on Jetson), yielding 10³–10⁴ $\times$ energy reductions relative to conventional CPU/GPU methods for attention. Within the accelerator, MP-MRF brings a direct energy saving of about 3.7 $\times$ , with on-demand fetch providing an additional 1.3 $\times$ .

When compared to prior sparse attention methods:

SpAtten (token pruning, no retraining): At 16 $\times$ pruning, MP-MRF delivers more than twice the accuracy; for equal accuracy, 5.3 $\times$ higher pruning ( $\to$ 1.7 $\times$ throughput and 1.6 $\times$ energy efficiency).
A $^3$ (8-bit approximate attention): Energon’s approach saves an additional 35% DRAM reads at fixed pruning ratios (1.35 $\times$ DRAM and 1.5 $\times$ core energy reduction), with a modest accuracy gain (Zhou et al., 2021).

7. Significance and Applicability

Mix-Precision Multi-Round Filtering provides a practical approximation to top- $k$ attention, combining computational efficiency, hardware-compatibility, and accuracy retention. Its incremental low-bitwidth filtering paradigm enables aggressive pruning without information loss, eliminating the need for retraining or elaborate masking schemes. MP-MRF is especially significant for edge deployments and real-time applications where memory bandwidth and compute power are limited. Its integration in Energon demonstrates that hardware–algorithm co-design can yield order-of-magnitude improvements in both speed and energy, bridging the gap between state-of-the-art AI models and deployable, cost-effective edge inference (Zhou et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

Energon: Towards Efficient Acceleration of Transformers Using Dynamic Sparse Attention (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mix-Precision Multi-Round Filtering (MP-MRF).

Mix-Precision Multi-Round Filtering (MP-MRF)

1. Motivation and Context

2. Algorithmic Description

3. Mathematical Formulation

4. Empirical Accuracy, Efficiency, and Error Behavior

5. Hardware Implementation in the Energon Co-Processor

6. Comparative Results and Performance Benchmarks

7. Significance and Applicability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Mix-Precision Multi-Round Filtering (MP-MRF)

1. Motivation and Context

2. Algorithmic Description

3. Mathematical Formulation

4. Empirical Accuracy, Efficiency, and Error Behavior

5. Hardware Implementation in the Energon Co-Processor

6. Comparative Results and Performance Benchmarks

7. Significance and Applicability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research