HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers (2211.08110v2)

Published 15 Nov 2022 in cs.AR, cs.AI, and cs.CV

Abstract: While vision transformers (ViTs) have continuously achieved new milestones in the field of computer vision, their sophisticated network architectures with high computation and memory costs have impeded their deployment on resource-limited edge devices. In this paper, we propose a hardware-efficient image-adaptive token pruning framework called HeatViT for efficient yet accurate ViT acceleration on embedded FPGAs. By analyzing the inherent computational patterns in ViTs, we first design an effective attention-based multi-head token selector, which can be progressively inserted before transformer blocks to dynamically identify and consolidate the non-informative tokens from input images. Moreover, we implement the token selector on hardware by adding miniature control logic to heavily reuse existing hardware components built for the backbone ViT. To improve the hardware efficiency, we further employ 8-bit fixed-point quantization, and propose polynomial approximations with regularization effect on quantization error for the frequently used nonlinear functions in ViTs. Finally, we propose a latency-aware multi-stage training strategy to determine the transformer blocks for inserting token selectors and optimize the desired (average) pruning rates for inserted token selectors, in order to improve both the model accuracy and inference latency on hardware. Compared to existing ViT pruning studies, under the similar computation cost, HeatViT can achieve 0.7%$\sim$8.9% higher accuracy; while under the similar model accuracy, HeatViT can achieve more than 28.4%$\sim$65.3% computation reduction, for various widely used ViTs, including DeiT-T, DeiT-S, DeiT-B, LV-ViT-S, and LV-ViT-M, on the ImageNet dataset. Compared to the baseline hardware accelerator, our implementations of HeatViT on the Xilinx ZCU102 FPGA achieve 3.46$\times$$\sim$4.89$\times$ speedup.

PDF Abstract

The paper "HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers" addresses the challenge of deploying vision transformers (ViTs) on resource-constrained edge devices, such as embedded FPGAs, by introducing a framework called HeatViT.

Vision transformers have demonstrated significant advancements in computer vision tasks; however, their complex architectures often result in high computational and memory demands, which limit their practical use on less capable hardware. HeatViT tackles this issue by utilizing a novel token pruning strategy that enhances both efficiency and performance.

Key Contributions:

Adaptive Token Pruning:
- The paper presents a unique attention-based multi-head token selector. This component is strategically integrated before transformer blocks to dynamically prune non-essential tokens from input images. This adaptive pruning is crucial for reducing computational load without significantly sacrificing accuracy.
Hardware Implementation:
- The token selector is implemented on hardware by introducing control logic that heavily relies on existing components of the backbone ViT architecture. This approach maximizes resource reuse and efficiency.
Quantization Techniques:
- To further enhance hardware efficiency, the authors employ 8-bit fixed-point quantization. Additionally, they propose polynomial approximations to mitigate quantization errors for ViTs' nonlinear functions, maintaining the required accuracy.
Latency-Aware Training Strategy:
- The implementation incorporates a multi-stage training process that is latency-aware. This strategy optimizes the placement of token selectors within the transformer blocks and fine-tunes the pruning rates. As a result, it achieves an optimal balance between model accuracy and inference speed.

Performance Benefits:

Compared to previous ViT pruning methodologies, HeatViT achieves between 0.7% and 8.9% higher accuracy at equivalent computation costs. Alternatively, for a similar level of accuracy, HeatViT reduces computational demands by 28.4% to 65.3% on several popular ViT models, such as DeiT-T, DeiT-S, DeiT-B, LV-ViT-S, and LV-ViT-M, all evaluated on the ImageNet dataset.
The hardware accelerator implementations show impressive speedup results on the Xilinx ZCU102 FPGA, with improvements ranging from 3.46 to 4.89 times compared to baseline hardware.

Overall, HeatViT presents a significant advancement in making vision transformers more viable for edge computation scenarios, offering enhanced performance without the need for additional expensive resources.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Peiyan Dong (18 papers)
Mengshu Sun (41 papers)
Alec Lu (4 papers)
Yanyue Xie (12 papers)
Kenneth Liu (1 paper)
Zhenglun Kong (33 papers)
Xin Meng (37 papers)
Zhengang Li (31 papers)
Xue Lin (92 papers)
Zhenman Fang (21 papers)
Yanzhi Wang (197 papers)

Citations (36)

View on Semantic Scholar

HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers (2211.08110v2)

Related Papers