Papers
Topics
Authors
Recent
Search
2000 character limit reached

HeatViT: Efficient ViT Token Pruning

Updated 4 December 2025
  • HeatViT is a hardware-efficient token pruning framework that dynamically retains important tokens, enabling swift Vision Transformer inference on resource-constrained platforms.
  • It employs a multi-head attention token selector combined with 8-bit fixed-point quantization and polynomial approximations to significantly reduce computation while maintaining accuracy.
  • Empirical evaluations show up to 65% reduction in computation and nearly 5× FPS speedup on FPGAs with minimal resource overhead and <1.2% accuracy drop.

HeatViT is a hardware-efficient image-adaptive token pruning framework designed to accelerate Vision Transformers (ViTs) for deployment on resource-constrained embedded platforms. By co-designing algorithmic and microarchitectural innovations—including a multi-head attention-based token selector, fixed-point quantization, custom nonlinear function approximations, and a latency-aware optimization strategy—HeatViT achieves significant reductions in computational load and power consumption while maintaining accuracy comparable to unpruned and state-of-the-art pruned ViT models. This approach has been validated primarily on FPGA implementations and evaluated against competitive edge inference platforms and pruning methods (Dong et al., 2022).

1. Attention-Based Multi-Head Token Selector

HeatViT introduces a progressive, image-adaptive token pruning mechanism at the core of its acceleration strategy. For a set of input tokens XRN×DX \in \mathbb{R}^{N \times D} with NN tokens of embedding dimension DD, the input is divided into hh attention heads (d=D/hd = D/h per head):

X=[x1;x2;;xh],xiRN×dX = [\,x_{1};\,x_{2};\dots;x_{h}\,],\quad x_{i}\in\mathbb{R}^{N\times d}

Each head ii computes:

  • Local embedding:

Eilocal=MLP1(xi)RN×(d/2)E_i^{\rm local} = \mathrm{MLP}_1(x_i) \in \mathbb{R}^{N \times (d/2)}

  • Global embedding:

Eiglobal=Average(MLP1(xi))R1×(d/2)E_i^{\rm global} = \mathrm{Average}(\mathrm{MLP}_1(x_i)) \in \mathbb{R}^{1 \times (d/2)}

  • The two embeddings are concatenated:

Ei=[Eilocal;Eiglobal]RN×dE_i = [\,E_i^{\rm local};\,E_i^{\rm global}\,] \in \mathbb{R}^{N \times d}

  • Keep/prune scores:

NN0

To enable per-head adaptive importance, a head-wise summary vector is computed and scored: NN1

NN2

Weighted mean scores per token are obtained: NN3

A discrete mask is sampled via Gumbel-Softmax: NN4 and updated progressively at each block: NN5

Token Packaging: Instead of outright discarding pruned tokens, HeatViT consolidates their information via a "package token": NN6 where NN7 are tokens marked “prune”. The kept tokens plus NN8 are then fed forward, allowing the system to correct errors through the pipeline.

2. FPGA Microarchitecture and Hardware Mapping

HeatViT is implemented on embedded FPGAs, notably the Xilinx ZCU102. The architecture aggressively reuses a single GEMM engine for both backbone ViT computations (Multi-Head Self Attention & Feed-Forward Networks) and token-selector FC layers. Data flows between off-chip DRAM, dedicated weight/token buffers, the GEMM engine, and activation units using double buffering to optimize I/O and compute overlap.

Matmul Loop-Tiling: Matrix multiplications are tiled over block sizes NN9 and head groups DD0 with a deeply nested loop:

Eilocal=MLP1(xi)RN×(d/2)E_i^{\rm local} = \mathrm{MLP}_1(x_i) \in \mathbb{R}^{N \times (d/2)}5

Parallel MACs scale as DD1, bounded by DSP availability and buffer sizes according to BRAM allocation. With DD2 MHz, throughput in frames-per-second is:

DD3

Resource Overhead: Additional logic for token-selectors introduces only DD4–DD5 LUT and DD6–DD7 DSP overhead compared to baseline ViT-only accelerators.

3. 8-Bit Fixed-Point Quantization and Polynomial Approximations

All weights and activations are quantized to 8-bit fixed-point precision to minimize resource use. Common nonlinear functions are replaced by custom polynomial approximations with regularization terms (DD8, DD9) to shrink quantization error:

  • erf(x) approximation:

hh0

  • GELU approximation:

hh1

  • Softmax approximation:

hh2

hh3

with hh4 and exponent decomposed for efficient computation.

  • Sigmoid: Approximated with a small piecewise-linear PLAN kernel.

These custom implementations yield hh5–hh6 lower LUT/DSP use than standard math libraries.

Error Bound Analysis: The regularization factors ensure for GELU: hh7 and similarly for Softmax: hh8 thus quantization error is compressed.

4. Latency-Aware Multi-Stage Training and Optimization

HeatViT selects both the transformer blocks on which to place token selectors and their average pruning rates hh9 subject to end-to-end latency (d=D/hd = D/h0) and accuracy drop (d=D/hd = D/h1) constraints.

A hardware latency lookup table quantifies blockwise latency against token keep ratios: d=D/hd = D/h2 with: d=D/hd = D/h3

A latency-sparsity loss is added to standard distillation and classification loss terms: d=D/hd = D/h4 with the full objective: d=D/hd = D/h5 where d=D/hd = D/h6, d=D/hd = D/h7.

The algorithm proceeds progressively:

  1. Insert selector at the last block (d=D/hd = D/h8), initialize d=D/hd = D/h9.
  2. Fine-tune and incrementally raise X=[x1;x2;;xh],xiRN×dX = [\,x_{1};\,x_{2};\dots;x_{h}\,],\quad x_{i}\in\mathbb{R}^{N\times d}0 until accuracy or latency constraint reached.
  3. Repeat for preceding blocks, stopping before block 4.
  4. Merge selectors into stages if neighboring X=[x1;x2;;xh],xiRN×dX = [\,x_{1};\,x_{2};\dots;x_{h}\,],\quad x_{i}\in\mathbb{R}^{N\times d}1 values are sufficiently close (X=[x1;x2;;xh],xiRN×dX = [\,x_{1};\,x_{2};\dots;x_{h}\,],\quad x_{i}\in\mathbb{R}^{N\times d}2 difference).
  5. Retrain as needed, requiring ≈90% of standard ViT training effort.

5. Empirical Performance and Comparative Evaluation

Key Results on ImageNet

Model Orig. Acc (%) Orig. GMACs HeatViT GMACs GMAC Reduction HeatViT Top-1 Acc (%)
DeiT-T 72.2 1.30 1.00/0.9/0.75 23.1–42.3% Baseline+0.7–8.9
DeiT-S 79.8 4.60 3.86/2.64/2.02 16–56% Baseline+0.7–8.9
LV-ViT-S 80.5 6.55 5.49/3.77/2.88 16–56% Baseline+0.7–8.9
DeiT-B/LV-ViT-M - - - Similar Similar
  • Accuracy vs. Computation: For constant compute, HeatViT outperforms the leading pruning methods by X=[x1;x2;;xh],xiRN×dX = [\,x_{1};\,x_{2};\dots;x_{h}\,],\quad x_{i}\in\mathbb{R}^{N\times d}3–X=[x1;x2;;xh],xiRN×dX = [\,x_{1};\,x_{2};\dots;x_{h}\,],\quad x_{i}\in\mathbb{R}^{N\times d}4 accuracy. For equal accuracy, it reduces computation by X=[x1;x2;;xh],xiRN×dX = [\,x_{1};\,x_{2};\dots;x_{h}\,],\quad x_{i}\in\mathbb{R}^{N\times d}5–X=[x1;x2;;xh],xiRN×dX = [\,x_{1};\,x_{2};\dots;x_{h}\,],\quad x_{i}\in\mathbb{R}^{N\times d}6 across tested ViT backbones.

Hardware Implementation (Xilinx ZCU102, 8-bit Pruning)

Backbone Baseline FPS Baseline Power (W) HeatViT FPS Speedup HeatViT Power (W) FPS/W
DeiT-T 78.3 8.012 271.2 X=[x1;x2;;xh],xiRN×dX = [\,x_{1};\,x_{2};\dots;x_{h}\,],\quad x_{i}\in\mathbb{R}^{N\times d}7 9.453 28.7
DeiT-S 25.9 10.095 109.2 X=[x1;x2;;xh],xiRN×dX = [\,x_{1};\,x_{2};\dots;x_{h}\,],\quad x_{i}\in\mathbb{R}^{N\times d}8 10.697 10.2
LV-ViT-S 19.4 - 89.1 X=[x1;x2;;xh],xiRN×dX = [\,x_{1};\,x_{2};\dots;x_{h}\,],\quad x_{i}\in\mathbb{R}^{N\times d}9 - -
DeiT-B 11.2 11.041 54.8 ii0 11.352 -
  • FPGA resource overhead: ii1–ii2 LUTs, ii3–ii4 DSPs.
  • GPU/CPU Comparison (Jetson TX2): HeatViT shows ii5–ii6 speedup over TX2 CPU, ii7–ii8 over TX2 GPU; energy efficiency is ii9–Eilocal=MLP1(xi)RN×(d/2)E_i^{\rm local} = \mathrm{MLP}_1(x_i) \in \mathbb{R}^{N \times (d/2)}0 over TX2 CPU and Eilocal=MLP1(xi)RN×(d/2)E_i^{\rm local} = \mathrm{MLP}_1(x_i) \in \mathbb{R}^{N \times (d/2)}1–Eilocal=MLP1(xi)RN×(d/2)E_i^{\rm local} = \mathrm{MLP}_1(x_i) \in \mathbb{R}^{N \times (d/2)}2 over TX2 GPU.
  • Accuracy Drop: Eilocal=MLP1(xi)RN×(d/2)E_i^{\rm local} = \mathrm{MLP}_1(x_i) \in \mathbb{R}^{N \times (d/2)}3–Eilocal=MLP1(xi)RN×(d/2)E_i^{\rm local} = \mathrm{MLP}_1(x_i) \in \mathbb{R}^{N \times (d/2)}4 at maximal pruning rates.

6. Context, Limitations, and Implications

HeatViT offers significant advances in hardware-aware structured pruning for transformer architectures. The dynamic token selector, progressive mask updating, and retention via package tokens serve both algorithmic efficiency and error correction. The use of resource-optimized fixed-point arithmetic and polynomial nonlinearities allows deployment on low-cost FPGAs, with empirical results indicating negligible accuracy loss and substantial improvement in throughput and efficiency.

This suggests HeatViT establishes a new baseline for deploying high-capacity ViTs on device-scale, power-limited hardware platforms with tailored adaptation to latency constraints and resource limits.

Potential limitations include the need for blockwise latency profiling per hardware instantiation, sensitivity of early-stage token selection, and retraining overhead for merging selectors into stages. A plausible implication is that similar techniques could be generalized to other non-ViT architectures requiring dynamic adaptive sparsity in hardware deployment scenarios.

Research on HeatViT was published by authors investigating hardware-software co-design for transformer acceleration (Dong et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HeatViT.