Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention (2305.07027v1)

Published 11 May 2023 in cs.CV

Abstract: Vision transformers have shown great success due to their high model capabilities. However, their remarkable performance is accompanied by heavy computation costs, which makes them unsuitable for real-time applications. In this paper, we propose a family of high-speed vision transformers named EfficientViT. We find that the speed of existing transformer models is commonly bounded by memory inefficient operations, especially the tensor reshaping and element-wise functions in MHSA. Therefore, we design a new building block with a sandwich layout, i.e., using a single memory-bound MHSA between efficient FFN layers, which improves memory efficiency while enhancing channel communication. Moreover, we discover that the attention maps share high similarities across heads, leading to computational redundancy. To address this, we present a cascaded group attention module feeding attention heads with different splits of the full feature, which not only saves computation cost but also improves attention diversity. Comprehensive experiments demonstrate EfficientViT outperforms existing efficient models, striking a good trade-off between speed and accuracy. For instance, our EfficientViT-M5 surpasses MobileNetV3-Large by 1.9% in accuracy, while getting 40.4% and 45.2% higher throughput on Nvidia V100 GPU and Intel Xeon CPU, respectively. Compared to the recent efficient model MobileViT-XXS, EfficientViT-M2 achieves 1.8% superior accuracy, while running 5.8x/3.7x faster on the GPU/CPU, and 7.4x faster when converted to ONNX format. Code and models are available at https://github.com/microsoft/Cream/tree/main/EfficientViT.

Vision Transformers (ViTs) have demonstrated strong performance in computer vision tasks, but their significant computational cost often makes them impractical for real-time applications on resource-constrained hardware. The paper "EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention" (Liu et al., 2023 ) addresses this challenge by proposing a new family of ViTs designed for high inference speed and efficiency.

The authors systematically analyze the speed bottlenecks in existing ViTs, identifying three main factors:

  1. Memory Access: Operations like tensor reshaping and element-wise functions within Multi-Head Self-Attention (MHSA) are memory-bound, meaning their speed is limited by data movement to and from memory, not computation.
  2. Computation Redundancy: Attention heads in standard MHSA often learn similar projections and attention maps, leading to redundant calculations.
  3. Parameter Usage: Parameter allocation strategies inherited from NLP transformers may not be optimal for visual tasks and efficient models.

Based on this analysis, EfficientViT introduces several key architectural and design principles:

1. EfficientViT Building Block:

The core of EfficientViT is a new building block that incorporates a memory-efficient sandwich layout, cascaded group attention, and parameter reallocation.

  • Sandwich Layout: Instead of alternating MHSA and Feed-Forward Network (FFN) layers equally, the block uses a single, potentially memory-bound self-attention layer sandwiched between multiple memory-efficient FFN layers. This design reduces the proportion of time spent on memory-bound operations while allowing sufficient channel communication through FFNs. The block structure is Xi+1=ΦiF(ΦiA(ΦiF(Xi)))X_{i+1} = \Phi^{\rm F}_i (\Phi^{\rm A}_i (\Phi^{\rm F}_i(X_i))), where ΦF\Phi^{\rm F} denotes multiple FFN layers and ΦA\Phi^{\rm A} is the self-attention layer. An extra depthwise convolution (DWConv) is applied before each FFN layer to incorporate local structural information.
  • Cascaded Group Attention (CGA): To combat attention head redundancy and improve computation efficiency, CGA feeds each attention head with a different split of the input feature channels. The outputs of successive heads are then cascaded, meaning the output of head j1j-1 is added to the input of head jj before its attention computation. This is formulated as:

    X~ij=Attn(XijWijQ,XijWijK,XijWijV)\widetilde{X}_{ij} = \text{Attn}(X'_{ij}W^{\rm Q}_{ij}, X'_{ij}W^{\rm K}_{ij}, X'_{ij}W^{\rm V}_{ij})

    Xij=Xij+X~i(j1)X'_{ij} = X_{ij} + \widetilde{X}_{i(j-1)} (for j>1j > 1)

    Xi1=Xi1X'_{i1} = X_{i1}

    where XijX_{ij} is the jj-th split of the input XiX_i, and X~i(j1)\widetilde{X}_{i(j-1)} is the output of the previous head. The final output is the concatenation of all head outputs followed by a projection layer. This approach reduces FLOPs and parameters similar to group convolutions and increases effective network depth.

  • Parameter Reallocation: Based on empirical analysis using structured pruning, the authors found that certain components are more critical than others. EfficientViT reallocates parameters by giving fewer channels to Q and K projections in attention heads and the hidden dimensions of FFNs (reducing the expansion ratio from 4 to 2). Conversely, the V projection channels are kept relatively large, close to the input embedding dimension. This makes parameter usage more efficient for speed and accuracy.

2. Network Architecture:

EfficientViT models have a hierarchical structure with three stages. An overlapping patch embedding layer is used initially. Downsampling between stages is performed by an EfficientViT subsample block, which uses an inverted residual block (similar to MobileNetV2) instead of self-attention within the sandwich layout for efficiency during resolution reduction.

Implementation Considerations:

  • Normalization and Activation: The authors opt for BatchNorm (BN) instead of LayerNorm (LN) because BN can be fused into preceding linear or convolutional layers during inference, improving runtime speed. They also use ReLU activation, which is generally faster than GELU or HardSwish and better supported on various inference platforms like ONNX and mobile chipsets (e.g., CoreML).
  • Training: Models are trained from scratch on ImageNet-1K using standard practices: AdamW optimizer, cosine learning rate scheduler, and data augmentations like Mixup, AutoAugment, and Random Erasing. Distillation with a strong teacher model is shown to further improve performance.
  • Throughput Evaluation: A key aspect of the paper is evaluating actual inference speed (throughput) on different hardware (Nvidia V100 GPU, Intel Xeon CPU) and deployment formats (ONNX, CoreML), providing practical benchmarks for real-world deployment. Batch size is also considered for GPU testing.
  • Framework: The models are built using PyTorch and Timm, making implementation relatively straightforward for developers familiar with these libraries. The code is publicly available, providing a direct reference for implementation.

Practical Applications:

EfficientViT targets applications requiring high-speed inference, such as:

  • Mobile and Edge Devices: The strong performance on CPU, ONNX, and mobile chipsets (like Apple A13 Bionic) makes EfficientViT suitable for deployment on smartphones, embedded systems, and other resource-constrained environments.
  • Real-time Computer Vision: Tasks like real-time object detection, image classification in video streams, or augmented reality require models that can process data quickly. EfficientViT's high throughput makes it a viable option for such scenarios.
  • Applications with Strict Latency Requirements: Any application where prediction speed is critical (e.g., autonomous driving perception, industrial automation) can benefit from a faster model architecture.

Performance:

The paper shows that EfficientViT models achieve a superior trade-off between speed and accuracy compared to existing efficient CNNs and ViTs. For example:

  • EfficientViT-M5 outperforms MobileNetV3-Large in accuracy and is significantly faster on both GPU and CPU.
  • EfficientViT-M2 surpasses MobileViT-XXS in accuracy and is multiple times faster across GPU, CPU, and ONNX.
  • EfficientViT-M4 shows competitive performance with state-of-the-art efficient models on object detection while having fewer FLOPs than some competitors.

Trade-offs:

While emphasizing speed and efficiency, EfficientViT models may have slightly more parameters than some highly optimized CNNs (e.g., MobileNetV3) for comparable accuracy. However, the design choices prioritize reducing memory-bound operations and improving actual throughput, which is often a more critical factor for real-time performance than just parameter count or FLOPs. The ablation studies demonstrate the empirical trade-offs between different design choices (e.g., number of FFNs, attention variants, activation functions) and their impact on accuracy and speed.

In summary, EfficientViT provides a well-analyzed and practically-oriented approach to designing fast vision transformers. Its focus on memory efficiency through the sandwich layout, computation efficiency via cascaded group attention, and optimized parameter allocation, combined with hardware-friendly layer choices, makes it a strong candidate for real-world applications demanding high inference speed on various hardware platforms. The provided architecture details and experimental results offer clear guidance for implementing and deploying EfficientViT models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xinyu Liu (123 papers)
  2. Houwen Peng (36 papers)
  3. Ningxin Zheng (15 papers)
  4. Yuqing Yang (83 papers)
  5. Han Hu (196 papers)
  6. Yixuan Yuan (67 papers)
Citations (172)