The paper "HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers" addresses the challenge of deploying vision transformers (ViTs) on resource-constrained edge devices, such as embedded FPGAs, by introducing a framework called HeatViT.
Vision transformers have demonstrated significant advancements in computer vision tasks; however, their complex architectures often result in high computational and memory demands, which limit their practical use on less capable hardware. HeatViT tackles this issue by utilizing a novel token pruning strategy that enhances both efficiency and performance.
Key Contributions:
- Adaptive Token Pruning:
- The paper presents a unique attention-based multi-head token selector. This component is strategically integrated before transformer blocks to dynamically prune non-essential tokens from input images. This adaptive pruning is crucial for reducing computational load without significantly sacrificing accuracy.
- Hardware Implementation:
- The token selector is implemented on hardware by introducing control logic that heavily relies on existing components of the backbone ViT architecture. This approach maximizes resource reuse and efficiency.
- Quantization Techniques:
- To further enhance hardware efficiency, the authors employ 8-bit fixed-point quantization. Additionally, they propose polynomial approximations to mitigate quantization errors for ViTs' nonlinear functions, maintaining the required accuracy.
- Latency-Aware Training Strategy:
- The implementation incorporates a multi-stage training process that is latency-aware. This strategy optimizes the placement of token selectors within the transformer blocks and fine-tunes the pruning rates. As a result, it achieves an optimal balance between model accuracy and inference speed.
Performance Benefits:
- Compared to previous ViT pruning methodologies, HeatViT achieves between 0.7% and 8.9% higher accuracy at equivalent computation costs. Alternatively, for a similar level of accuracy, HeatViT reduces computational demands by 28.4% to 65.3% on several popular ViT models, such as DeiT-T, DeiT-S, DeiT-B, LV-ViT-S, and LV-ViT-M, all evaluated on the ImageNet dataset.
- The hardware accelerator implementations show impressive speedup results on the Xilinx ZCU102 FPGA, with improvements ranging from 3.46 to 4.89 times compared to baseline hardware.
Overall, HeatViT presents a significant advancement in making vision transformers more viable for edge computation scenarios, offering enhanced performance without the need for additional expensive resources.