Papers
Topics
Authors
Recent
2000 character limit reached

HeatViT: Efficient ViT Token Pruning

Updated 4 December 2025
  • HeatViT is a hardware-efficient token pruning framework that dynamically retains important tokens, enabling swift Vision Transformer inference on resource-constrained platforms.
  • It employs a multi-head attention token selector combined with 8-bit fixed-point quantization and polynomial approximations to significantly reduce computation while maintaining accuracy.
  • Empirical evaluations show up to 65% reduction in computation and nearly 5× FPS speedup on FPGAs with minimal resource overhead and <1.2% accuracy drop.

HeatViT is a hardware-efficient image-adaptive token pruning framework designed to accelerate Vision Transformers (ViTs) for deployment on resource-constrained embedded platforms. By co-designing algorithmic and microarchitectural innovations—including a multi-head attention-based token selector, fixed-point quantization, custom nonlinear function approximations, and a latency-aware optimization strategy—HeatViT achieves significant reductions in computational load and power consumption while maintaining accuracy comparable to unpruned and state-of-the-art pruned ViT models. This approach has been validated primarily on FPGA implementations and evaluated against competitive edge inference platforms and pruning methods (Dong et al., 2022).

1. Attention-Based Multi-Head Token Selector

HeatViT introduces a progressive, image-adaptive token pruning mechanism at the core of its acceleration strategy. For a set of input tokens XRN×DX \in \mathbb{R}^{N \times D} with NN tokens of embedding dimension DD, the input is divided into hh attention heads (d=D/hd = D/h per head):

X=[x1;x2;;xh],xiRN×dX = [\,x_{1};\,x_{2};\dots;x_{h}\,],\quad x_{i}\in\mathbb{R}^{N\times d}

Each head ii computes:

  • Local embedding:

Eilocal=MLP1(xi)RN×(d/2)E_i^{\rm local} = \mathrm{MLP}_1(x_i) \in \mathbb{R}^{N \times (d/2)}

  • Global embedding:

Eiglobal=Average(MLP1(xi))R1×(d/2)E_i^{\rm global} = \mathrm{Average}(\mathrm{MLP}_1(x_i)) \in \mathbb{R}^{1 \times (d/2)}

  • The two embeddings are concatenated:

Ei=[Eilocal;Eiglobal]RN×dE_i = [\,E_i^{\rm local};\,E_i^{\rm global}\,] \in \mathbb{R}^{N \times d}

  • Keep/prune scores:

si=Softmax(MLP2(Ei))RN×2s_i = \mathrm{Softmax}(\mathrm{MLP}_2(E_i)) \in \mathbb{R}^{N \times 2}

To enable per-head adaptive importance, a head-wise summary vector is computed and scored: Xˉ=Concat{1dj=1dxi,j}i=1hRN×h\bar X = \mathrm{Concat}\Bigl\{\frac{1}{d} \sum_{j=1}^d x_{i,j}\Bigr\}_{i=1}^h \in \mathbb{R}^{N \times h}

A=Sigmoid(MLP3(Xˉ))RN×hA = \mathrm{Sigmoid}(\mathrm{MLP}_3(\bar X)) \in \mathbb{R}^{N \times h}

Weighted mean scores per token are obtained: S~=i=1hA,isii=1hA,iRN×2\tilde S = \frac{\sum_{i=1}^h A_{\ast,i} \circ s_i}{\sum_{i=1}^h A_{\ast,i}} \in \mathbb{R}^{N \times 2}

A discrete mask is sampled via Gumbel-Softmax: M=GumbelSoftmax(S~){0,1}NM = \mathrm{GumbelSoftmax}(\tilde S) \in \{0,1\}^N and updated progressively at each block: MMMM \leftarrow M \odot M'

Token Packaging: Instead of outright discarding pruned tokens, HeatViT consolidates their information via a "package token": P=t=1Ts~t[0]  x^tt=1Ts~t[0]P = \frac{\sum_{t=1}^T \tilde s_t[0]\;\hat x_t}{\sum_{t=1}^T \tilde s_t[0]} where {x^t}t=1T\{\hat x_t\}_{t=1}^T are tokens marked “prune”. The kept tokens plus PP are then fed forward, allowing the system to correct errors through the pipeline.

2. FPGA Microarchitecture and Hardware Mapping

HeatViT is implemented on embedded FPGAs, notably the Xilinx ZCU102. The architecture aggressively reuses a single GEMM engine for both backbone ViT computations (Multi-Head Self Attention & Feed-Forward Networks) and token-selector FC layers. Data flows between off-chip DRAM, dedicated weight/token buffers, the GEMM engine, and activation units using double buffering to optimize I/O and compute overlap.

Matmul Loop-Tiling: Matrix multiplications are tiled over block sizes Ti×ToT_i \times T_o and head groups ThT_h with a deeply nested loop:

1
2
3
4
5
6
for o_o in 0..D_o/T_o:
  for i_h in 0..h/T_h:
    for i_i in 0..D_i/T_i:
      load A[T_i×T_h], B[T_h×T_o]
      C[T_i×T_o] += A·B
      # Conditional head splitting/accumulation

Parallel MACs scale as Ti×To×ThT_i \times T_o \times T_h, bounded by DSP availability and buffer sizes according to BRAM allocation. With fclk=150f_{clk}=150 MHz, throughput in frames-per-second is:

FPS=fclkl=1Lcycles(l),cycles(l)NlDi,lDo,lTiThTo\mathrm{FPS} = \frac{f_{\rm clk}}{\sum_{l=1}^L \text{cycles}(l)}, \quad \text{cycles}(l) \approx \frac{N_l D_{i,l} D_{o,l}}{T_i T_h T_o}

Resource Overhead: Additional logic for token-selectors introduces only +5+58%8\% LUT and +8+811%11\% DSP overhead compared to baseline ViT-only accelerators.

3. 8-Bit Fixed-Point Quantization and Polynomial Approximations

All weights and activations are quantized to 8-bit fixed-point precision to minimize resource use. Common nonlinear functions are replaced by custom polynomial approximations with regularization terms (δ1\delta_1, δ2<1\delta_2 < 1) to shrink quantization error:

  • erf(x) approximation:

Lerf(x)=sign(x)δ1[a(clip(x,b)+b)2+1], a=0.2888,b=1.769L_{\mathrm{erf}(x)} = \mathrm{sign}(x)\,\delta_1 \Bigl[a(\mathrm{clip}(|x|, -b) + b)^2 + 1\Bigr], \ a = -0.2888, b = 1.769

  • GELU approximation:

GELUaprx(x)=x2[1+Lerf(x/2)]\mathrm{GELU}_{\mathrm{aprx}}(x) = \frac{x}{2} \Bigl[1 + L_{\mathrm{erf}}(x/\sqrt{2})\Bigr]

  • Softmax approximation:

x~i=xixmax\tilde x_i = x_i - x_{\max}

Softmaxaprx(x~i)=δ2exp(x~i)jexp(x~j)\mathrm{Softmax}_{\mathrm{aprx}}(\tilde x_i) = \frac{\delta_2 \exp(\tilde x_i)}{\sum_j \exp(\tilde x_j)}

with exp(p)0.3585(p+1.353)2+0.344\exp(p)\approx 0.3585(p+1.353)^2 +0.344 and exponent decomposed for efficient computation.

  • Sigmoid: Approximated with a small piecewise-linear PLAN kernel.

These custom implementations yield 1.5×1.5\times572×572\times lower LUT/DSP use than standard math libraries.

Error Bound Analysis: The regularization factors ensure for GELU: Errorgelu=AxΔe<Δe\mathrm{Error}_{\rm gelu} = \Bigl|\frac{\partial A}{\partial x}\Bigr|\Delta e < \Delta e and similarly for Softmax: Errorsoftmax=2δ2ΔeA0(1A0)<Δe\mathrm{Error}_{\rm softmax} = 2\delta_2|\Delta e|A_0(1-A_0) < \Delta e thus quantization error is compressed.

4. Latency-Aware Multi-Stage Training and Optimization

HeatViT selects both the transformer blocks on which to place token selectors and their average pruning rates ρi\rho_i subject to end-to-end latency (LmaxL_{\max}) and accuracy drop (adropa_{\rm drop}) constraints.

A hardware latency lookup table quantifies blockwise latency against token keep ratios: BlockLatency(ρ)=FPGA-measured block latency at keep-ratio ρ\mathrm{BlockLatency}(\rho) = \text{FPGA-measured block latency at keep-ratio } \rho with: i=1LBlockLatency(ρi)Lmax\sum_{i=1}^L \mathrm{BlockLatency}(\rho_i) \leq L_{\max}

A latency-sparsity loss is added to standard distillation and classification loss terms: ξratio=i=1L(1ρi1Bb=1Bn=1NMni,b)2\xi_{\rm ratio} = \sum_{i=1}^L \Bigl(1 - \rho_i - \frac{1}{B}\sum_{b=1}^B\sum_{n=1}^{N} M_n^{i,b}\Bigr)^2 with the full objective: ξ=ξcls+λdistillξdistill+λratioξratio\xi = \xi_{\rm cls} + \lambda_{\rm distill}\,\xi_{\rm distill} + \lambda_{\rm ratio}\,\xi_{\rm ratio} where λdistill=0.5\lambda_{\rm distill}=0.5, λratio=2\lambda_{\rm ratio}=2.

The algorithm proceeds progressively:

  1. Insert selector at the last block (LL), initialize ρi\rho_i.
  2. Fine-tune and incrementally raise ρi\rho_i until accuracy or latency constraint reached.
  3. Repeat for preceding blocks, stopping before block 4.
  4. Merge selectors into stages if neighboring ρi\rho_i values are sufficiently close (<8.5%<8.5\% difference).
  5. Retrain as needed, requiring ≈90% of standard ViT training effort.

5. Empirical Performance and Comparative Evaluation

Key Results on ImageNet

Model Orig. Acc (%) Orig. GMACs HeatViT GMACs GMAC Reduction HeatViT Top-1 Acc (%)
DeiT-T 72.2 1.30 1.00/0.9/0.75 23.1–42.3% Baseline+0.7–8.9
DeiT-S 79.8 4.60 3.86/2.64/2.02 16–56% Baseline+0.7–8.9
LV-ViT-S 80.5 6.55 5.49/3.77/2.88 16–56% Baseline+0.7–8.9
DeiT-B/LV-ViT-M - - - Similar Similar
  • Accuracy vs. Computation: For constant compute, HeatViT outperforms the leading pruning methods by +0.7%+0.7\%8.9%8.9\% accuracy. For equal accuracy, it reduces computation by 28.4%28.4\%65.3%65.3\% across tested ViT backbones.

Hardware Implementation (Xilinx ZCU102, 8-bit Pruning)

Backbone Baseline FPS Baseline Power (W) HeatViT FPS Speedup HeatViT Power (W) FPS/W
DeiT-T 78.3 8.012 271.2 3.46×3.46\times 9.453 28.7
DeiT-S 25.9 10.095 109.2 4.22×4.22\times 10.697 10.2
LV-ViT-S 19.4 - 89.1 4.59×4.59\times - -
DeiT-B 11.2 11.041 54.8 4.89×4.89\times 11.352 -
  • FPGA resource overhead: +5+58%8\% LUTs, +8+811%11\% DSPs.
  • GPU/CPU Comparison (Jetson TX2): HeatViT shows 1827×1\,827\times3013×3\,013\times speedup over TX2 CPU, 2.68×2.68\times3.79×3.79\times over TX2 GPU; energy efficiency is 242×242\times719×719\times over TX2 CPU and 3.0×3.0\times4.7×4.7\times over TX2 GPU.
  • Accuracy Drop: $0.0$–1.2%1.2\% at maximal pruning rates.

6. Context, Limitations, and Implications

HeatViT offers significant advances in hardware-aware structured pruning for transformer architectures. The dynamic token selector, progressive mask updating, and retention via package tokens serve both algorithmic efficiency and error correction. The use of resource-optimized fixed-point arithmetic and polynomial nonlinearities allows deployment on low-cost FPGAs, with empirical results indicating negligible accuracy loss and substantial improvement in throughput and efficiency.

This suggests HeatViT establishes a new baseline for deploying high-capacity ViTs on device-scale, power-limited hardware platforms with tailored adaptation to latency constraints and resource limits.

Potential limitations include the need for blockwise latency profiling per hardware instantiation, sensitivity of early-stage token selection, and retraining overhead for merging selectors into stages. A plausible implication is that similar techniques could be generalized to other non-ViT architectures requiring dynamic adaptive sparsity in hardware deployment scenarios.

Research on HeatViT was published by authors investigating hardware-software co-design for transformer acceleration (Dong et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to HeatViT.