EfficientViT-L1: Efficient Vision Transformer
- EfficientViT-L1 is a hierarchical vision transformer backbone that uses multi-scale linear attention to balance global context and local detail for dense prediction tasks.
- It combines a convolutional stem with progressive transformer stages featuring structured downsampling, ensuring both computational efficiency and high throughput.
- The architecture delivers significant latency reductions and performance gains in semantic segmentation and super-resolution, ideal for resource-constrained deployments.
EfficientViT-L1 is a hierarchical high-resolution vision transformer backbone optimized for dense prediction tasks under stringent computational budgets. It introduces multi-scale linear attention to achieve the global receptive field and multi-scale processing—key for semantic segmentation, super-resolution, and similar vision workloads—while utilizing only lightweight, hardware-efficient operations. The EfficientViT-L1 configuration emphasizes structured downsampling, localized convolutional aggregation, and global linear kernel-based attention to provide high throughput and competitive model accuracy in real-world deployment scenarios (Cai et al., 2022).
1. Architectural Design
EfficientViT-L1 employs a five-part architecture: an initial convolutional “stem” followed by a four-stage transformer backbone. Stages 3 and 4 integrate the model’s defining multi-scale linear attention (MSLA) blocks, while earlier stages focus on feature extraction via convolutional and feed-forward projections. Downsampling occurs at each stage boundary, resulting in progressive reduction of resolution and increased abstraction. The input is assumed to be pixels for benchmarking purposes. The overall architectural flow is as follows:
- Stem: Two 3×3 convolutions with stride 2 successively lower resolution to .
- Stage 1: Processes at ; primarily feed-forward, no attention.
- Stage 2: Downsamples to ; again, no attention.
- Stage 3: Operates at ; MSLA introduced.
- Stage 4: Operates at ; MSLA present.
Both MSLA stages use a combination of ordinary ( identity) and depth-wise separable (DWS) convolutional branches, aggregating both global and local context (Cai et al., 2022).
2. Per-Stage Configuration
EfficientViT-L1’s stage-wise block depth, channel width, and expansion ratios can be summarized as follows (exact values are representative estimates based on GitHub configuration, as the original paper does not explicitly list all hyperparameters for L1):
| Stage | Output Resolution | # Blocks () | Hidden Channels () | FFN Expansion () | MS-Linear-Attn |
|---|---|---|---|---|---|
| 1 | 2 | 64 | 4 | None | |
| 2 | 2 | 128 | 4 | None | |
| 3 | 6 | 256 | 4 | (two-branch, heads ) | |
| 4 | 2 | 512 | 4 | (same as Stage 3) |
A plausible implication is that attention heads per MSLA stage are in the range of , giving a per-head dimension for each attention split (Cai et al., 2022).
3. Multi-Scale Linear Attention Mechanism
The core contribution of EfficientViT-L1 is the MSLA mechanism, first applied in Stages 3 and 4. This reformulates the standard softmax self-attention with a linear ReLU kernel , reducing memory and compute complexity. Attention is computed as follows:
- Input projections: Keys , queries , and values generated from input tokens (), each of size .
- Multi-scale aggregation: , , each are processed by (a) identity and (b) DWS conv, creating a set for .
- Kernel-feature computation: For each scale, compute summary statistics
Aggregate across all scales via learned weights , then output for each token :
where
This architectural innovation enables both global and local contextualization without expensive full-rank softmax attention, providing the model with requisite expressivity for dense tasks while maintaining efficiency (Cai et al., 2022).
4. Computational Efficiency and Hyperparameters
EfficientViT-L1 is explicitly designed for high-throughput inference and deployment on resource-constrained hardware. Reported statistics:
- Parameter count: Approximately 53 million
- Multiply-Accumulates (MACs): 5.3 G for inputs
- Edge GPU (Jetson AGX Orin, TensorRT fp16, batch size 1): 2.6 ms per image
- Cloud GPU (A100, TensorRT fp16): 6,207 images/sec (≈0.16 ms/img)
- Mobile CPU (Snapdragon 8 Gen 1, TFLite fp32): B-series at comparable size runs in 30–50 ms (not reported directly for L1)
A plausible implication is that EfficientViT-L1’s latency and compute characteristics enable near real-time performance in edge and embedded vision applications, with throughput far exceeding contemporary softmax-attention-based and large-kernel convolution backbones at similar accuracy (Cai et al., 2022).
5. Functional Workflow and Pseudocode
The EfficientViTBlock, as applied in MSLA stages, is captured by the following workflow:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
def EfficientViTBlock(x): # x: [N, C] Q = x @ Wq K = x @ Wk V = x @ Wv # Multi-scale token aggregation Qs = [Q, ConvDW5x5(Q)] Ks = [K, ConvDW5x5(K)] Vs = [V, ConvDW5x5(V)] # Linear attention per scale; sum outputs S_s = [sum_j ReLU(Ks[s][j]).T * Vs[s][j] for s in {0,1}] z_s = [sum_j ReLU(Ks[s][j]).T for s in {0,1}] S = W0 @ S_s[0] + W1 @ S_s[1] z = W0 @ z_s[0] + W1 @ z_s[1] # Per-token output for each i: A_i = ReLU(Q[i]) @ S / (ReLU(Q[i]) @ z) O = stack(A_i) # Final linear fuse O = O @ Wo # Add & norm x = LayerNorm(x + O) # FFN + depthwise conv y = x @ W1_ffn y = DepthwiseConv3x3(y) y = Activation(y) y = y @ W2_ffn x = LayerNorm(x + y) return x |
In this workflow, linear attention and feed-forward processing are interleaved, separated by local DWS convolution to increase inductive bias toward spatial priors while retaining the efficiency of linear attention (Cai et al., 2022).
6. Application Domains and Comparative Performance
EfficientViT-L1 and its family are targeted primarily at high-resolution dense prediction tasks such as semantic segmentation (Cityscapes), image super-resolution, and general vision backbones for downstream transfer. Notable observations from benchmarking:
- Up to GPU latency reduction over SegFormer, over SegNeXt, with no loss in segmentation accuracy (Cityscapes).
- For super-resolution, speedup over Restormer with a measured PSNR gain ( dB).
- In zero-shot instance segmentation (COCO, Segment Anything), throughput improvement on A100 GPU.
- Across tasks and hardware, EfficientViT-L1 and its configuration deliver substantial gains in throughput versus softmax-attention and large-kernel CNN backbones at comparable or superior accuracy levels (Cai et al., 2022).
7. Model Scope and Configurability
EfficientViT-L1 represents the smallest "L-series" configuration among EfficientViT variants described in (Cai et al., 2022). Deeper and wider configurations, such as L2 and L3, increase both block counts and hidden channel dimensions for heightened accuracy; for example, going from to blocks per stage, and to hidden channels. All EfficientViT-L models retain the same core attention and stage design, emphasizing modular configurability. Exact per-model hyperparameters are maintained in the official GitHub repository referenced by the authors (Cai et al., 2022). This design principle facilitates model selection across deployment scenarios and computational environments.
EfficientViT-L1 exemplifies the trend toward vision transformers optimized for memory and computational efficiency, balancing global receptive field, local spatial aggregation, and practical deployability for state-of-the-art dense prediction tasks. All technical details are traceable to (Cai et al., 2022).