AdaViT: Adaptive Tokens for Efficient Vision Transformer (2112.07658v3)

Published 14 Dec 2021 in cs.CV and cs.LG

Abstract: We introduce A-ViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. A-ViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds. We reformulate Adaptive Computation Time (ACT) for this task, extending halting to discard redundant spatial tokens. The appealing architectural properties of vision transformers enables our adaptive token reduction mechanism to speed up inference without modifying the network architecture or inference hardware. We demonstrate that A-ViT requires no extra parameters or sub-network for halting, as we base the learning of adaptive halting on the original network parameters. We further introduce distributional prior regularization that stabilizes training compared to prior ACT approaches. On the image classification task (ImageNet1K), we show that our proposed A-ViT yields high efficacy in filtering informative spatial features and cutting down on the overall compute. The proposed method improves the throughput of DeiT-Tiny by 62% and DeiT-Small by 38% with only 0.3% accuracy drop, outperforming prior art by a large margin. Project page at https://a-vit.github.io/

Citations (261)

View on Semantic Scholar

Summary

The paper introduces adaptive token computation that discards redundant tokens to reduce inference cost without additional parameters.
The methodology adapts the Adaptive Computation Time approach from NLP, yielding a 62% speedup on DeiT-Tiny and 38% on DeiT-Small with minimal accuracy loss.
The approach offers practical benefits for real-world applications such as autonomous driving and real-time video analytics in resource-constrained environments.

A-ViT: Adaptive Tokens for Efficient Vision Transformer

The paper presents A-ViT, a method designed to enhance the efficiency of Vision Transformers (ViTs) by introducing adaptive token computation. Vision Transformers, like their NLP counterparts, have gained prominence in visual tasks due to their capability to capture long-range dependencies through self-attention mechanisms. However, ViTs inherit the computational intensity typical of transformers, primarily resulting from the quadratic scaling of attention operations with the number of tokens. A-ViT addresses this challenge by adaptively adjusting the inference cost based on input image complexity, thereby optimizing computational resources effectively.

Methodology

A-ViT reformulates the Adaptive Computation Time (ACT) approach, traditionally used in NLP, and introduces it to the vision domain by discarding redundant spatial tokens. This is achieved by introducing a halting mechanism that decides the computation depth for tokens based on a learned halting probability. Intriguingly, the architecture is designed to operate without additional parameters, making the solution cost-effective. The halting mechanism utilizes a simple modification where a neuron from the last dense layer in existing transformer blocks is borrowed to compute the halting probability. This adaptive halting is integrated seamlessly with the current transformer block, thereby aligning computational effort with informative tokens and gradually reducing tokens processed in deeper layers. Such design inherently leads to significant inference speedups.

Strong Numerical Results

Empirical validation is conducted on image classification using the ImageNet1K dataset. Results indicate that A-ViT efficaciously filters informative spatial features while markedly reducing computation demands. Specifically, the approach improves the throughput of DeiT-Tiny by 62% and DeiT-Small by 38%, with a negligible accuracy drop of only 0.3%. These results outperform prior methods by a substantial margin, underscoring A-ViT's balance between computational efficiency and predictive performance.

Theoretical and Practical Implications

From a theoretical perspective, the innovation lies in adapting token computation dynamically based on image complexity, a notion borrowed and nuanced from NLP's ACT. This adaptive computation potentially paves the way for further extensions in ViTs, as it hints at decomposing tasks beyond classification to potentially other complex vision tasks like object detection and segmentation.

Practically, A-ViT's design, which maximizes computational efficiency on off-the-shelf hardware, holds implications for the deployment of vision transformers in resource-constrained environments, such as edge devices. By leveraging the existing architecture of ViTs without the need for added parameters, it provides an immediately deployable solution that enhances processing speed without relying on specialized hardware. This adaptability makes it attractive for real-world applications where inference time is crucial, such as autonomous driving and real-time video analytics.

Future Directions

Looking ahead, exploring the incorporation of A-ViT into more sophisticated ViT architectures could yield further performance improvements. Additionally, adapting this halting mechanism for video processing, where temporal redundancy can also be exploited, presents a promising avenue for future research. It would be particularly interesting to paper how this approach fares in more diverse datasets beyond ImageNet1K, potentially accelerating its adoption across varied visual tasks.

In conclusion, A-ViT exemplifies a methodical advance in vision transformer efficiency by intelligently modulating computational effort, ensuring that resources are conserved while maintaining a high level of predictive accuracy. This research is a prime example of taking theoretical concepts from one domain and effectively adapting them to benefit another, providing a tangible impact on both the understanding and application of vision transformers.

PDF Markdown

Related Papers

GitHub

A-ViT: Adaptive Tokens for Efficient Vision Transformer