LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference (2104.01136v2)

Published 2 Apr 2021 in cs.CV

Abstract: We design a family of image classification architectures that optimize the trade-off between accuracy and efficiency in a high-speed regime. Our work exploits recent findings in attention-based architectures, which are competitive on highly parallel processing hardware. We revisit principles from the extensive literature on convolutional neural networks to apply them to transformers, in particular activation maps with decreasing resolutions. We also introduce the attention bias, a new way to integrate positional information in vision transformers. As a result, we propose LeVIT: a hybrid neural network for fast inference image classification. We consider different measures of efficiency on different hardware platforms, so as to best reflect a wide range of application scenarios. Our extensive experiments empirically validate our technical choices and show they are suitable to most architectures. Overall, LeViT significantly outperforms existing convnets and vision transformers with respect to the speed/accuracy tradeoff. For example, at 80% ImageNet top-1 accuracy, LeViT is 5 times faster than EfficientNet on CPU. We release the code at https://github.com/facebookresearch/LeViT

PDF Abstract

LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference

The paper entitled "LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference" presents an innovative approach in the domain of image classification models, specifically focusing on optimizing the trade-off between accuracy and efficiency. This research leverages the inherent advantages of both convolutional neural networks (CNNs) and vision transformers to design a hybrid architecture known as LeViT. The model aims to provide a significant speed advantage over existing architectures while maintaining competitive accuracy levels.

Architectural Insights

The authors introduce a family of image classification architectures that integrate principles from CNNs into the transformer framework, with particular attention to methods that support decreasing resolutions and the efficient integration of positional information. These elements are combined into a hybrid structure that leverages the parallel processing capabilities intrinsic to transformer architectures.

Key Contributions

The paper underscores several novel components integral to the LeViT architecture:

Multi-Stage Transformer Architecture: This utilizes attention as a mechanism to handle downsampling, permitting a reduction in resolution while maintaining feature integrity.
Efficient Patch Descriptor: A computationally efficient representation at the initial layers helps increase the overall efficiency.
Attention Bias Mechanism: This introduces a translation-invariant attention bias replacing traditional positional encodings, thereby enhancing model performance.
Optimized Attention-MLP Block: Redesigning this block to improve the network's capacity relative to computational time ensures a balanced allocation of resources in terms of computation.

Performance Evaluation

Empirical evaluations demonstrate that LeViT significantly enhances the speed-to-accuracy ratio when compared to existing architectures like EfficientNet and vision transformers like ViT/DeiT. Notably, LeViT achieves superior performance metrics, being up to five times faster than EfficientNet at an 80% ImageNet top-1 accuracy level on CPU. These results illustrate the efficacy of the design choices in achieving high-performance, resource-efficient image classification.

Implications and Future Work

The implications of introducing LeViT extend beyond computational efficiency; they provide a new perspective on hybrid architectures that intelligently combine the strengths of CNNs and transformers. This opens avenues for further exploration in developing models that can seamlessly operate in resource-constrained environments without sacrificing the quality of inference outcomes.

The research also suggests potential optimization paths for future AI developments, including the enhancement of mobile and edge computing capabilities, where inference speed is a critical parameter. Continuing advancements in this field may focus on extending the principles demonstrated in LeViT to other domains or integrating them with emerging AI technologies.

In summary, LeViT represents a noteworthy contribution to the advancement of image classification architectures, particularly in contexts requiring rapid and efficient inference. The proposed architectural innovations showcase substantial potential for application in diverse computational settings, offering a robust foundation for future research.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Ben Graham (5 papers)
Alaaeldin El-Nouby (21 papers)
Hugo Touvron (22 papers)
Pierre Stock (19 papers)
Armand Joulin (81 papers)
Hervé Jégou (71 papers)
Matthijs Douze (52 papers)

Citations (683)

View on Semantic Scholar

LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference (2104.01136v2)