HeatViT: Efficient ViT Token Pruning
- HeatViT is a hardware-efficient token pruning framework that dynamically retains important tokens, enabling swift Vision Transformer inference on resource-constrained platforms.
- It employs a multi-head attention token selector combined with 8-bit fixed-point quantization and polynomial approximations to significantly reduce computation while maintaining accuracy.
- Empirical evaluations show up to 65% reduction in computation and nearly 5× FPS speedup on FPGAs with minimal resource overhead and <1.2% accuracy drop.
HeatViT is a hardware-efficient image-adaptive token pruning framework designed to accelerate Vision Transformers (ViTs) for deployment on resource-constrained embedded platforms. By co-designing algorithmic and microarchitectural innovations—including a multi-head attention-based token selector, fixed-point quantization, custom nonlinear function approximations, and a latency-aware optimization strategy—HeatViT achieves significant reductions in computational load and power consumption while maintaining accuracy comparable to unpruned and state-of-the-art pruned ViT models. This approach has been validated primarily on FPGA implementations and evaluated against competitive edge inference platforms and pruning methods (Dong et al., 2022).
1. Attention-Based Multi-Head Token Selector
HeatViT introduces a progressive, image-adaptive token pruning mechanism at the core of its acceleration strategy. For a set of input tokens with tokens of embedding dimension , the input is divided into attention heads ( per head):
Each head computes:
- Local embedding:
- Global embedding:
- The two embeddings are concatenated:
- Keep/prune scores:
To enable per-head adaptive importance, a head-wise summary vector is computed and scored:
Weighted mean scores per token are obtained:
A discrete mask is sampled via Gumbel-Softmax: and updated progressively at each block:
Token Packaging: Instead of outright discarding pruned tokens, HeatViT consolidates their information via a "package token": where are tokens marked “prune”. The kept tokens plus are then fed forward, allowing the system to correct errors through the pipeline.
2. FPGA Microarchitecture and Hardware Mapping
HeatViT is implemented on embedded FPGAs, notably the Xilinx ZCU102. The architecture aggressively reuses a single GEMM engine for both backbone ViT computations (Multi-Head Self Attention & Feed-Forward Networks) and token-selector FC layers. Data flows between off-chip DRAM, dedicated weight/token buffers, the GEMM engine, and activation units using double buffering to optimize I/O and compute overlap.
Matmul Loop-Tiling: Matrix multiplications are tiled over block sizes and head groups with a deeply nested loop:
1 2 3 4 5 6 |
for o_o in 0..D_o/T_o: for i_h in 0..h/T_h: for i_i in 0..D_i/T_i: load A[T_i×T_h], B[T_h×T_o] C[T_i×T_o] += A·B # Conditional head splitting/accumulation |
Parallel MACs scale as , bounded by DSP availability and buffer sizes according to BRAM allocation. With MHz, throughput in frames-per-second is:
Resource Overhead: Additional logic for token-selectors introduces only – LUT and – DSP overhead compared to baseline ViT-only accelerators.
3. 8-Bit Fixed-Point Quantization and Polynomial Approximations
All weights and activations are quantized to 8-bit fixed-point precision to minimize resource use. Common nonlinear functions are replaced by custom polynomial approximations with regularization terms (, ) to shrink quantization error:
- erf(x) approximation:
- GELU approximation:
- Softmax approximation:
with and exponent decomposed for efficient computation.
- Sigmoid: Approximated with a small piecewise-linear PLAN kernel.
These custom implementations yield – lower LUT/DSP use than standard math libraries.
Error Bound Analysis: The regularization factors ensure for GELU: and similarly for Softmax: thus quantization error is compressed.
4. Latency-Aware Multi-Stage Training and Optimization
HeatViT selects both the transformer blocks on which to place token selectors and their average pruning rates subject to end-to-end latency () and accuracy drop () constraints.
A hardware latency lookup table quantifies blockwise latency against token keep ratios: with:
A latency-sparsity loss is added to standard distillation and classification loss terms: with the full objective: where , .
The algorithm proceeds progressively:
- Insert selector at the last block (), initialize .
- Fine-tune and incrementally raise until accuracy or latency constraint reached.
- Repeat for preceding blocks, stopping before block 4.
- Merge selectors into stages if neighboring values are sufficiently close ( difference).
- Retrain as needed, requiring ≈90% of standard ViT training effort.
5. Empirical Performance and Comparative Evaluation
Key Results on ImageNet
| Model | Orig. Acc (%) | Orig. GMACs | HeatViT GMACs | GMAC Reduction | HeatViT Top-1 Acc (%) |
|---|---|---|---|---|---|
| DeiT-T | 72.2 | 1.30 | 1.00/0.9/0.75 | 23.1–42.3% | Baseline+0.7–8.9 |
| DeiT-S | 79.8 | 4.60 | 3.86/2.64/2.02 | 16–56% | Baseline+0.7–8.9 |
| LV-ViT-S | 80.5 | 6.55 | 5.49/3.77/2.88 | 16–56% | Baseline+0.7–8.9 |
| DeiT-B/LV-ViT-M | - | - | - | Similar | Similar |
- Accuracy vs. Computation: For constant compute, HeatViT outperforms the leading pruning methods by – accuracy. For equal accuracy, it reduces computation by – across tested ViT backbones.
Hardware Implementation (Xilinx ZCU102, 8-bit Pruning)
| Backbone | Baseline FPS | Baseline Power (W) | HeatViT FPS | Speedup | HeatViT Power (W) | FPS/W |
|---|---|---|---|---|---|---|
| DeiT-T | 78.3 | 8.012 | 271.2 | 9.453 | 28.7 | |
| DeiT-S | 25.9 | 10.095 | 109.2 | 10.697 | 10.2 | |
| LV-ViT-S | 19.4 | - | 89.1 | - | - | |
| DeiT-B | 11.2 | 11.041 | 54.8 | 11.352 | - |
- FPGA resource overhead: – LUTs, – DSPs.
- GPU/CPU Comparison (Jetson TX2): HeatViT shows – speedup over TX2 CPU, – over TX2 GPU; energy efficiency is – over TX2 CPU and – over TX2 GPU.
- Accuracy Drop: $0.0$– at maximal pruning rates.
6. Context, Limitations, and Implications
HeatViT offers significant advances in hardware-aware structured pruning for transformer architectures. The dynamic token selector, progressive mask updating, and retention via package tokens serve both algorithmic efficiency and error correction. The use of resource-optimized fixed-point arithmetic and polynomial nonlinearities allows deployment on low-cost FPGAs, with empirical results indicating negligible accuracy loss and substantial improvement in throughput and efficiency.
This suggests HeatViT establishes a new baseline for deploying high-capacity ViTs on device-scale, power-limited hardware platforms with tailored adaptation to latency constraints and resource limits.
Potential limitations include the need for blockwise latency profiling per hardware instantiation, sensitivity of early-stage token selection, and retraining overhead for merging selectors into stages. A plausible implication is that similar techniques could be generalized to other non-ViT architectures requiring dynamic adaptive sparsity in hardware deployment scenarios.
Research on HeatViT was published by authors investigating hardware-software co-design for transformer acceleration (Dong et al., 2022).