EfficientViT-L1: Efficient Vision Transformer

Updated 14 January 2026

EfficientViT-L1 is a hierarchical vision transformer backbone that uses multi-scale linear attention to balance global context and local detail for dense prediction tasks.
It combines a convolutional stem with progressive transformer stages featuring structured downsampling, ensuring both computational efficiency and high throughput.
The architecture delivers significant latency reductions and performance gains in semantic segmentation and super-resolution, ideal for resource-constrained deployments.

EfficientViT-L1 is a hierarchical high-resolution vision transformer backbone optimized for dense prediction tasks under stringent computational budgets. It introduces multi-scale linear attention to achieve the global receptive field and multi-scale processing—key for semantic segmentation, super-resolution, and similar vision workloads—while utilizing only lightweight, hardware-efficient operations. The EfficientViT-L1 configuration emphasizes structured downsampling, localized convolutional aggregation, and global linear kernel-based attention to provide high throughput and competitive model accuracy in real-world deployment scenarios (Cai et al., 2022).

1. Architectural Design

EfficientViT-L1 employs a five-part architecture: an initial convolutional “stem” followed by a four-stage transformer backbone. Stages 3 and 4 integrate the model’s defining multi-scale linear attention (MSLA) blocks, while earlier stages focus on feature extraction via convolutional and feed-forward projections. Downsampling occurs at each stage boundary, resulting in progressive reduction of resolution and increased abstraction. The input is assumed to be $224 \times 224$ pixels for benchmarking purposes. The overall architectural flow is as follows:

Stem: Two 3×3 convolutions with stride 2 successively lower resolution to $56 \times 56$ .
Stage 1: Processes at $56 \times 56$ ; primarily feed-forward, no attention.
Stage 2: Downsamples to $28 \times 28$ ; again, no attention.
Stage 3: Operates at $14 \times 14$ ; MSLA introduced.
Stage 4: Operates at $7 \times 7$ ; MSLA present.

Both MSLA stages use a combination of ordinary ( $1\times1$ identity) and $5\times5$ depth-wise separable (DWS) convolutional branches, aggregating both global and local context (Cai et al., 2022).

2. Per-Stage Configuration

EfficientViT-L1’s stage-wise block depth, channel width, and expansion ratios can be summarized as follows (exact values are representative estimates based on GitHub configuration, as the original paper does not explicitly list all hyperparameters for L1):

Stage	Output Resolution	# Blocks ( $B_s$ )	Hidden Channels ( $C_s$ )	FFN Expansion ( $r_s$ )	MS-Linear-Attn
1	$56 \times 56$	2	64	4	None
2	$28 \times 28$	2	128	4	None
3	$14 \times 14$	6	256	4	$\checkmark$ (two-branch, heads $H$ )
4	$7 \times 7$	2	512	4	$\checkmark$ (same as Stage 3)

A plausible implication is that attention heads per MSLA stage are in the range of $H=8$ , giving a per-head dimension $d = C_s/H$ for each attention split (Cai et al., 2022).

3. Multi-Scale Linear Attention Mechanism

The core contribution of EfficientViT-L1 is the MSLA mechanism, first applied in Stages 3 and 4. This reformulates the standard softmax self-attention with a linear ReLU kernel $\Phi(\cdot) = \operatorname{ReLU}(\cdot)$ , reducing memory and compute complexity. Attention is computed as follows:

Input projections: Keys $K$ , queries $Q$ , and values $V$ generated from input tokens ( $x$ ), each of size $[N, d]$ .
Multi-scale aggregation: $Q$ , $K$ , $V$ each are processed by (a) identity and (b) $5\times5$ DWS conv, creating a set $\{Q_s, K_s, V_s\}$ for $s \in \{1, 2\}$ .
Kernel-feature computation: For each scale, compute summary statistics

$S_s = \sum_{j=1}^N \Phi(K_{s, j})^\top V_{s, j}, \qquad z_s = \sum_{j=1}^N \Phi(K_{s, j})^\top$

Aggregate across all scales via learned weights $W_s$ , then output for each token $i$ :

$O_i = \frac{ \Phi(Q_i) \cdot \widetilde{V} }{ \Phi(Q_i) \cdot \widetilde{z} }$

where

$\widetilde{V} = \sum_{s=1}^2 W_s S_s, \qquad \widetilde{z} = \sum_{s=1}^2 W_s z_s$

This architectural innovation enables both global and local contextualization without expensive full-rank softmax attention, providing the model with requisite expressivity for dense tasks while maintaining efficiency (Cai et al., 2022).

4. Computational Efficiency and Hyperparameters

EfficientViT-L1 is explicitly designed for high-throughput inference and deployment on resource-constrained hardware. Reported statistics:

Parameter count: Approximately 53 million
Multiply-Accumulates (MACs): 5.3 G for $224 \times 224$ inputs
Edge GPU (Jetson AGX Orin, TensorRT fp16, batch size 1): 2.6 ms per image
Cloud GPU (A100, TensorRT fp16): 6,207 images/sec (≈0.16 ms/img)
Mobile CPU (Snapdragon 8 Gen 1, TFLite fp32): B-series at comparable size runs in 30–50 ms (not reported directly for L1)

A plausible implication is that EfficientViT-L1’s latency and compute characteristics enable near real-time performance in edge and embedded vision applications, with throughput far exceeding contemporary softmax-attention-based and large-kernel convolution backbones at similar accuracy (Cai et al., 2022).

5. Functional Workflow and Pseudocode

The EfficientViTBlock, as applied in MSLA stages, is captured by the following workflow:

def EfficientViTBlock(x):
    # x: [N, C]
    Q = x @ Wq
    K = x @ Wk
    V = x @ Wv

    # Multi-scale token aggregation
    Qs = [Q, ConvDW5x5(Q)]
    Ks = [K, ConvDW5x5(K)]
    Vs = [V, ConvDW5x5(V)]

    # Linear attention per scale; sum outputs
    S_s = [sum_j ReLU(Ks[s][j]).T * Vs[s][j] for s in {0,1}]
    z_s = [sum_j ReLU(Ks[s][j]).T         for s in {0,1}]
    S = W0 @ S_s[0] + W1 @ S_s[1]
    z = W0 @ z_s[0] + W1 @ z_s[1]

    # Per-token output
    for each i:
        A_i = ReLU(Q[i]) @ S / (ReLU(Q[i]) @ z)
    O = stack(A_i)

    # Final linear fuse
    O = O @ Wo

    # Add & norm
    x = LayerNorm(x + O)

    # FFN + depthwise conv
    y = x @ W1_ffn
    y = DepthwiseConv3x3(y)
    y = Activation(y)
    y = y @ W2_ffn
    x = LayerNorm(x + y)
    return x

In this workflow, linear attention and feed-forward processing are interleaved, separated by local DWS convolution to increase inductive bias toward spatial priors while retaining the efficiency of linear attention (Cai et al., 2022).

6. Application Domains and Comparative Performance

EfficientViT-L1 and its family are targeted primarily at high-resolution dense prediction tasks such as semantic segmentation (Cityscapes), image super-resolution, and general vision backbones for downstream transfer. Notable observations from benchmarking:

Up to $13.9\times$ GPU latency reduction over SegFormer, $6.2\times$ over SegNeXt, with no loss in segmentation accuracy (Cityscapes).
For super-resolution, $6.4\times$ speedup over Restormer with a measured PSNR gain ( $+0.11$ dB).
In zero-shot instance segmentation (COCO, Segment Anything), $48.9\times$ throughput improvement on A100 GPU.
Across tasks and hardware, EfficientViT-L1 and its configuration deliver substantial gains in throughput versus softmax-attention and large-kernel CNN backbones at comparable or superior accuracy levels (Cai et al., 2022).

7. Model Scope and Configurability

EfficientViT-L1 represents the smallest "L-series" configuration among EfficientViT variants described in (Cai et al., 2022). Deeper and wider configurations, such as L2 and L3, increase both block counts and hidden channel dimensions for heightened accuracy; for example, going from $(2,2,6,2)$ to $(3,3,8,3)$ blocks per stage, and $(64,128,256,512)$ to $(80,160,320,640)$ hidden channels. All EfficientViT-L models retain the same core attention and stage design, emphasizing modular configurability. Exact per-model hyperparameters are maintained in the official GitHub repository referenced by the authors (Cai et al., 2022). This design principle facilitates model selection across deployment scenarios and computational environments.

EfficientViT-L1 exemplifies the trend toward vision transformers optimized for memory and computational efficiency, balancing global receptive field, local spatial aggregation, and practical deployability for state-of-the-art dense prediction tasks. All technical details are traceable to (Cai et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EfficientViT-L1 Backbone.