HeatViT: Efficient ViT Token Pruning

Updated 4 December 2025

HeatViT is a hardware-efficient token pruning framework that dynamically retains important tokens, enabling swift Vision Transformer inference on resource-constrained platforms.
It employs a multi-head attention token selector combined with 8-bit fixed-point quantization and polynomial approximations to significantly reduce computation while maintaining accuracy.
Empirical evaluations show up to 65% reduction in computation and nearly 5× FPS speedup on FPGAs with minimal resource overhead and <1.2% accuracy drop.

HeatViT is a hardware-efficient image-adaptive token pruning framework designed to accelerate Vision Transformers (ViTs) for deployment on resource-constrained embedded platforms. By co-designing algorithmic and microarchitectural innovations—including a multi-head attention-based token selector, fixed-point quantization, custom nonlinear function approximations, and a latency-aware optimization strategy—HeatViT achieves significant reductions in computational load and power consumption while maintaining accuracy comparable to unpruned and state-of-the-art pruned ViT models. This approach has been validated primarily on FPGA implementations and evaluated against competitive edge inference platforms and pruning methods (Dong et al., 2022).

1. Attention-Based Multi-Head Token Selector

HeatViT introduces a progressive, image-adaptive token pruning mechanism at the core of its acceleration strategy. For a set of input tokens $X \in \mathbb{R}^{N \times D}$ with $N$ tokens of embedding dimension $D$ , the input is divided into $h$ attention heads ( $d = D/h$ per head):

$X = [\,x_{1};\,x_{2};\dots;x_{h}\,],\quad x_{i}\in\mathbb{R}^{N\times d}$

Each head $i$ computes:

Local embedding:

$E_i^{\rm local} = \mathrm{MLP}_1(x_i) \in \mathbb{R}^{N \times (d/2)}$

Global embedding:

$E_i^{\rm global} = \mathrm{Average}(\mathrm{MLP}_1(x_i)) \in \mathbb{R}^{1 \times (d/2)}$

The two embeddings are concatenated:

$E_i = [\,E_i^{\rm local};\,E_i^{\rm global}\,] \in \mathbb{R}^{N \times d}$

Keep/prune scores:

$s_i = \mathrm{Softmax}(\mathrm{MLP}_2(E_i)) \in \mathbb{R}^{N \times 2}$

To enable per-head adaptive importance, a head-wise summary vector is computed and scored: $\bar X = \mathrm{Concat}\Bigl\{\frac{1}{d} \sum_{j=1}^d x_{i,j}\Bigr\}_{i=1}^h \in \mathbb{R}^{N \times h}$

$A = \mathrm{Sigmoid}(\mathrm{MLP}_3(\bar X)) \in \mathbb{R}^{N \times h}$

Weighted mean scores per token are obtained: $\tilde S = \frac{\sum_{i=1}^h A_{\ast,i} \circ s_i}{\sum_{i=1}^h A_{\ast,i}} \in \mathbb{R}^{N \times 2}$

A discrete mask is sampled via Gumbel-Softmax: $M = \mathrm{GumbelSoftmax}(\tilde S) \in \{0,1\}^N$ and updated progressively at each block: $M \leftarrow M \odot M'$

Token Packaging: Instead of outright discarding pruned tokens, HeatViT consolidates their information via a "package token": $P = \frac{\sum_{t=1}^T \tilde s_t[0]\;\hat x_t}{\sum_{t=1}^T \tilde s_t[0]}$ where $\{\hat x_t\}_{t=1}^T$ are tokens marked “prune”. The kept tokens plus $P$ are then fed forward, allowing the system to correct errors through the pipeline.

2. FPGA Microarchitecture and Hardware Mapping

HeatViT is implemented on embedded FPGAs, notably the Xilinx ZCU102. The architecture aggressively reuses a single GEMM engine for both backbone ViT computations (Multi-Head Self Attention & Feed-Forward Networks) and token-selector FC layers. Data flows between off-chip DRAM, dedicated weight/token buffers, the GEMM engine, and activation units using double buffering to optimize I/O and compute overlap.

Matmul Loop-Tiling: Matrix multiplications are tiled over block sizes $T_i \times T_o$ and head groups $T_h$ with a deeply nested loop:

for o_o in 0..D_o/T_o:
  for i_h in 0..h/T_h:
    for i_i in 0..D_i/T_i:
      load A[T_i×T_h], B[T_h×T_o]
      C[T_i×T_o] += A·B
      # Conditional head splitting/accumulation

Parallel MACs scale as $T_i \times T_o \times T_h$ , bounded by DSP availability and buffer sizes according to BRAM allocation. With $f_{clk}=150$ MHz, throughput in frames-per-second is:

$\mathrm{FPS} = \frac{f_{\rm clk}}{\sum_{l=1}^L \text{cycles}(l)}, \quad \text{cycles}(l) \approx \frac{N_l D_{i,l} D_{o,l}}{T_i T_h T_o}$

Resource Overhead: Additional logic for token-selectors introduces only $+5$ – $8\%$ LUT and $+8$ – $11\%$ DSP overhead compared to baseline ViT-only accelerators.

3. 8-Bit Fixed-Point Quantization and Polynomial Approximations

All weights and activations are quantized to 8-bit fixed-point precision to minimize resource use. Common nonlinear functions are replaced by custom polynomial approximations with regularization terms ( $\delta_1$ , $\delta_2 < 1$ ) to shrink quantization error:

erf(x) approximation:

$L_{\mathrm{erf}(x)} = \mathrm{sign}(x)\,\delta_1 \Bigl[a(\mathrm{clip}(|x|, -b) + b)^2 + 1\Bigr], \ a = -0.2888, b = 1.769$

GELU approximation:

$\mathrm{GELU}_{\mathrm{aprx}}(x) = \frac{x}{2} \Bigl[1 + L_{\mathrm{erf}}(x/\sqrt{2})\Bigr]$

Softmax approximation:

$\tilde x_i = x_i - x_{\max}$

$\mathrm{Softmax}_{\mathrm{aprx}}(\tilde x_i) = \frac{\delta_2 \exp(\tilde x_i)}{\sum_j \exp(\tilde x_j)}$

with $\exp(p)\approx 0.3585(p+1.353)^2 +0.344$ and exponent decomposed for efficient computation.

Sigmoid: Approximated with a small piecewise-linear PLAN kernel.

These custom implementations yield $1.5\times$ – $572\times$ lower LUT/DSP use than standard math libraries.

Error Bound Analysis: The regularization factors ensure for GELU: $\mathrm{Error}_{\rm gelu} = \Bigl|\frac{\partial A}{\partial x}\Bigr|\Delta e < \Delta e$ and similarly for Softmax: $\mathrm{Error}_{\rm softmax} = 2\delta_2|\Delta e|A_0(1-A_0) < \Delta e$ thus quantization error is compressed.

4. Latency-Aware Multi-Stage Training and Optimization

HeatViT selects both the transformer blocks on which to place token selectors and their average pruning rates $\rho_i$ subject to end-to-end latency ( $L_{\max}$ ) and accuracy drop ( $a_{\rm drop}$ ) constraints.

A hardware latency lookup table quantifies blockwise latency against token keep ratios: $\mathrm{BlockLatency}(\rho) = \text{FPGA-measured block latency at keep-ratio } \rho$ with: $\sum_{i=1}^L \mathrm{BlockLatency}(\rho_i) \leq L_{\max}$

A latency-sparsity loss is added to standard distillation and classification loss terms: $\xi_{\rm ratio} = \sum_{i=1}^L \Bigl(1 - \rho_i - \frac{1}{B}\sum_{b=1}^B\sum_{n=1}^{N} M_n^{i,b}\Bigr)^2$ with the full objective: $\xi = \xi_{\rm cls} + \lambda_{\rm distill}\,\xi_{\rm distill} + \lambda_{\rm ratio}\,\xi_{\rm ratio}$ where $\lambda_{\rm distill}=0.5$ , $\lambda_{\rm ratio}=2$ .

The algorithm proceeds progressively:

Insert selector at the last block ( $L$ ), initialize $\rho_i$ .
Fine-tune and incrementally raise $\rho_i$ until accuracy or latency constraint reached.
Repeat for preceding blocks, stopping before block 4.
Merge selectors into stages if neighboring $\rho_i$ values are sufficiently close ( $<8.5\%$ difference).
Retrain as needed, requiring ≈90% of standard ViT training effort.

5. Empirical Performance and Comparative Evaluation

Key Results on ImageNet

Model	Orig. Acc (%)	Orig. GMACs	HeatViT GMACs	GMAC Reduction	HeatViT Top-1 Acc (%)
DeiT-T	72.2	1.30	1.00/0.9/0.75	23.1–42.3%	Baseline+0.7–8.9
DeiT-S	79.8	4.60	3.86/2.64/2.02	16–56%	Baseline+0.7–8.9
LV-ViT-S	80.5	6.55	5.49/3.77/2.88	16–56%	Baseline+0.7–8.9
DeiT-B/LV-ViT-M	-	-	-	Similar	Similar

Accuracy vs. Computation: For constant compute, HeatViT outperforms the leading pruning methods by $+0.7\%$ – $8.9\%$ accuracy. For equal accuracy, it reduces computation by $28.4\%$ – $65.3\%$ across tested ViT backbones.

Hardware Implementation (Xilinx ZCU102, 8-bit Pruning)

Backbone	Baseline FPS	Baseline Power (W)	HeatViT FPS	Speedup	HeatViT Power (W)	FPS/W
DeiT-T	78.3	8.012	271.2	$3.46\times$	9.453	28.7
DeiT-S	25.9	10.095	109.2	$4.22\times$	10.697	10.2
LV-ViT-S	19.4	-	89.1	$4.59\times$	-	-
DeiT-B	11.2	11.041	54.8	$4.89\times$	11.352	-

FPGA resource overhead: $+5$ – $8\%$ LUTs, $+8$ – $11\%$ DSPs.
GPU/CPU Comparison (Jetson TX2): HeatViT shows $1\,827\times$ – $3\,013\times$ speedup over TX2 CPU, $2.68\times$ – $3.79\times$ over TX2 GPU; energy efficiency is $242\times$ – $719\times$ over TX2 CPU and $3.0\times$ – $4.7\times$ over TX2 GPU.
Accuracy Drop: $0.0$– $1.2\%$ at maximal pruning rates.

6. Context, Limitations, and Implications

HeatViT offers significant advances in hardware-aware structured pruning for transformer architectures. The dynamic token selector, progressive mask updating, and retention via package tokens serve both algorithmic efficiency and error correction. The use of resource-optimized fixed-point arithmetic and polynomial nonlinearities allows deployment on low-cost FPGAs, with empirical results indicating negligible accuracy loss and substantial improvement in throughput and efficiency.

This suggests HeatViT establishes a new baseline for deploying high-capacity ViTs on device-scale, power-limited hardware platforms with tailored adaptation to latency constraints and resource limits.

Potential limitations include the need for blockwise latency profiling per hardware instantiation, sensitivity of early-stage token selection, and retraining overhead for merging selectors into stages. A plausible implication is that similar techniques could be generalized to other non-ViT architectures requiring dynamic adaptive sparsity in hardware deployment scenarios.

Research on HeatViT was published by authors investigating hardware-software co-design for transformer acceleration (Dong et al., 2022).

PDF Markdown Chat (Pro)

References (1)

HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers (2022)

HeatViT: Efficient ViT Token Pruning

1. Attention-Based Multi-Head Token Selector

2. FPGA Microarchitecture and Hardware Mapping

3. 8-Bit Fixed-Point Quantization and Polynomial Approximations

4. Latency-Aware Multi-Stage Training and Optimization

5. Empirical Performance and Comparative Evaluation

Key Results on ImageNet

Hardware Implementation (Xilinx ZCU102, 8-bit Pruning)

6. Context, Limitations, and Implications

Whiteboard

Follow Topic

Continue Learning

HeatViT: Efficient ViT Token Pruning

1. Attention-Based Multi-Head Token Selector

2. FPGA Microarchitecture and Hardware Mapping

3. 8-Bit Fixed-Point Quantization and Polynomial Approximations

4. Latency-Aware Multi-Stage Training and Optimization

5. Empirical Performance and Comparative Evaluation

Key Results on ImageNet

Hardware Implementation (Xilinx ZCU102, 8-bit Pruning)

6. Context, Limitations, and Implications

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics