Vision Transformers (ViTs) have demonstrated strong performance in computer vision tasks, but their significant computational cost often makes them impractical for real-time applications on resource-constrained hardware. The paper "EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention" (Liu et al., 2023 ) addresses this challenge by proposing a new family of ViTs designed for high inference speed and efficiency.
The authors systematically analyze the speed bottlenecks in existing ViTs, identifying three main factors:
- Memory Access: Operations like tensor reshaping and element-wise functions within Multi-Head Self-Attention (MHSA) are memory-bound, meaning their speed is limited by data movement to and from memory, not computation.
- Computation Redundancy: Attention heads in standard MHSA often learn similar projections and attention maps, leading to redundant calculations.
- Parameter Usage: Parameter allocation strategies inherited from NLP transformers may not be optimal for visual tasks and efficient models.
Based on this analysis, EfficientViT introduces several key architectural and design principles:
1. EfficientViT Building Block:
The core of EfficientViT is a new building block that incorporates a memory-efficient sandwich layout, cascaded group attention, and parameter reallocation.
- Sandwich Layout: Instead of alternating MHSA and Feed-Forward Network (FFN) layers equally, the block uses a single, potentially memory-bound self-attention layer sandwiched between multiple memory-efficient FFN layers. This design reduces the proportion of time spent on memory-bound operations while allowing sufficient channel communication through FFNs. The block structure is , where denotes multiple FFN layers and is the self-attention layer. An extra depthwise convolution (DWConv) is applied before each FFN layer to incorporate local structural information.
- Cascaded Group Attention (CGA): To combat attention head redundancy and improve computation efficiency, CGA feeds each attention head with a different split of the input feature channels. The outputs of successive heads are then cascaded, meaning the output of head is added to the input of head before its attention computation. This is formulated as:
(for )
where is the -th split of the input , and is the output of the previous head. The final output is the concatenation of all head outputs followed by a projection layer. This approach reduces FLOPs and parameters similar to group convolutions and increases effective network depth.
- Parameter Reallocation: Based on empirical analysis using structured pruning, the authors found that certain components are more critical than others. EfficientViT reallocates parameters by giving fewer channels to Q and K projections in attention heads and the hidden dimensions of FFNs (reducing the expansion ratio from 4 to 2). Conversely, the V projection channels are kept relatively large, close to the input embedding dimension. This makes parameter usage more efficient for speed and accuracy.
2. Network Architecture:
EfficientViT models have a hierarchical structure with three stages. An overlapping patch embedding layer is used initially. Downsampling between stages is performed by an EfficientViT subsample block, which uses an inverted residual block (similar to MobileNetV2) instead of self-attention within the sandwich layout for efficiency during resolution reduction.
Implementation Considerations:
- Normalization and Activation: The authors opt for BatchNorm (BN) instead of LayerNorm (LN) because BN can be fused into preceding linear or convolutional layers during inference, improving runtime speed. They also use ReLU activation, which is generally faster than GELU or HardSwish and better supported on various inference platforms like ONNX and mobile chipsets (e.g., CoreML).
- Training: Models are trained from scratch on ImageNet-1K using standard practices: AdamW optimizer, cosine learning rate scheduler, and data augmentations like Mixup, AutoAugment, and Random Erasing. Distillation with a strong teacher model is shown to further improve performance.
- Throughput Evaluation: A key aspect of the paper is evaluating actual inference speed (throughput) on different hardware (Nvidia V100 GPU, Intel Xeon CPU) and deployment formats (ONNX, CoreML), providing practical benchmarks for real-world deployment. Batch size is also considered for GPU testing.
- Framework: The models are built using PyTorch and Timm, making implementation relatively straightforward for developers familiar with these libraries. The code is publicly available, providing a direct reference for implementation.
Practical Applications:
EfficientViT targets applications requiring high-speed inference, such as:
- Mobile and Edge Devices: The strong performance on CPU, ONNX, and mobile chipsets (like Apple A13 Bionic) makes EfficientViT suitable for deployment on smartphones, embedded systems, and other resource-constrained environments.
- Real-time Computer Vision: Tasks like real-time object detection, image classification in video streams, or augmented reality require models that can process data quickly. EfficientViT's high throughput makes it a viable option for such scenarios.
- Applications with Strict Latency Requirements: Any application where prediction speed is critical (e.g., autonomous driving perception, industrial automation) can benefit from a faster model architecture.
Performance:
The paper shows that EfficientViT models achieve a superior trade-off between speed and accuracy compared to existing efficient CNNs and ViTs. For example:
- EfficientViT-M5 outperforms MobileNetV3-Large in accuracy and is significantly faster on both GPU and CPU.
- EfficientViT-M2 surpasses MobileViT-XXS in accuracy and is multiple times faster across GPU, CPU, and ONNX.
- EfficientViT-M4 shows competitive performance with state-of-the-art efficient models on object detection while having fewer FLOPs than some competitors.
Trade-offs:
While emphasizing speed and efficiency, EfficientViT models may have slightly more parameters than some highly optimized CNNs (e.g., MobileNetV3) for comparable accuracy. However, the design choices prioritize reducing memory-bound operations and improving actual throughput, which is often a more critical factor for real-time performance than just parameter count or FLOPs. The ablation studies demonstrate the empirical trade-offs between different design choices (e.g., number of FFNs, attention variants, activation functions) and their impact on accuracy and speed.
In summary, EfficientViT provides a well-analyzed and practically-oriented approach to designing fast vision transformers. Its focus on memory efficiency through the sandwich layout, computation efficiency via cascaded group attention, and optimized parameter allocation, combined with hardware-friendly layer choices, makes it a strong candidate for real-world applications demanding high inference speed on various hardware platforms. The provided architecture details and experimental results offer clear guidance for implementing and deploying EfficientViT models.