EdgeViT: Efficient Vision Transformers for Edge
- EdgeViT is a family of vision transformer architectures optimized for edge devices by integrating hierarchical pyramidal designs with local-global attention.
- Its innovative LGL bottleneck and distributed inference frameworks significantly reduce latency, energy consumption, and model size while maintaining high accuracy.
- Practical deployments on smartphones, Raspberry Pis, and FPGAs demonstrate EdgeViT's superior efficiency compared to traditional ViTs and lightweight CNNs.
EdgeViT refers to a family of architectural approaches and frameworks designed to make Vision Transformers (ViTs) efficient, accurate, and deployable on resource-constrained edge devices, such as smartphones, IoT endpoints, and embedded systems. The central challenge addressed by EdgeViT is the prohibitive computational and memory cost of self-attention and dense MLP layers in conventional ViTs, especially under real-world constraints of latency, energy, and operator support on non-server-class hardware. Solutions under the “EdgeViT” umbrella span architectural innovations merging convolutional and transformer operations, distributed inference via model partitioning, and hardware-software co-designs for efficient deployment (Pan et al., 2022, Liu et al., 2024, Xu et al., 2023, Nag et al., 2 Nov 2025).
1. Motivation and Key Challenges
Vision Transformers provide state-of-the-art accuracy for image classification, detection, and segmentation, but the canonical ViT design has quadratic computational cost for Multi-Head Self-Attention (MHSA): for tokens with spatial resolution and channels. While lightweight CNNs like MobileNet-V2/V3 and EfficientNet excel on mobile devices, previous attempts to create efficient ViTs—such as PVT, Swin, Twins, MobileViT—either retain per-token MHSA or use nonstandard operations, leading to high latency and energy on hardware. FLOPs and parameter count are found inadequate as real-world proxies; cache/memory-access, operator efficiency, and framework support strongly affect observed latency and energy. The EdgeViT approaches aim to realize Pareto-optimal efficiency on edge hardware by targeting these system-level bottlenecks directly, not merely reducing FLOPs or total parameters (Pan et al., 2022).
2. Core EdgeViT Architectures and Algorithms
2.1 EdgeViT (Pyramid LGL Bottleneck Design)
EdgeViT adopts a hierarchical four-stage feature pyramid with successively coarser resolutions and higher channel widths. Three practical model sizes—XXS, XS, S—span from 0.6 to 1.9 GFLOPs and 4.1M to 11.1M parameters (see table below).
| Model | Channels per Stage | Blocks | MHSA Heads | GFLOPs | Params (M) |
|---|---|---|---|---|---|
| XXS | [36, 72, 144, 288] | [1, 1, 3, 2] | [1,2,4,8] | 0.56 | 4.1 |
| XS | [48, 96, 240, 384] | [1, 1, 2, 2] | [1,2,4,8] | 1.1 | 6.7 |
| S | [48, 96, 240, 384] | [1, 2, 3, 2] | [1,2,4,8] | 1.9 | 11.1 |
Within each stage, the Local–Global–Local (LGL) block is the main innovation. Each LGL bottleneck:
- Begins with a small, convolution-based positional encoding.
- Performs local aggregation with a depthwise-separable convolution.
- Applies sparse global self-attention only to delegate tokens (subsampled by rate ), reducing quadratic attention cost by .
- Propagates updated global information locally via transposed depthwise convolutions with kernel/stride .
- Finishes with a channel-wise feed-forward network (FFN).
This design leverages widely supported convolutional primitives, ensuring hardware efficiency, while preserving the representational power of global attention (Pan et al., 2022).
2.2 EdgeViT for Distributed Inference (ED-ViT)
EdgeViT also refers to distributed inference frameworks (e.g., “ED-ViT”) that partition a full ViT into pruned sub-models, each mapped to an edge device. The main steps are:
- Model splitting: Partition the ViT into sub-models, each responsible for a disjoint subset of the label space.
- Class-wise pruning: For each sub-model and associated class subset, remove least important channels, attention heads, and FFN neurons based on KL-divergence measured importance; this drastically shrinks model size and compute.
- Device assignment: Allocate pruned sub-models to devices by a greedy knapsack method balancing memory and compute budgets.
- Result fusion: Concatenate low-dimensional outputs from all devices and fuse with a lightweight MLP to recover the final prediction (Liu et al., 2024).
This approach yields up to 34× model-size reduction and 29× speedup in inference latency, maintaining near-original test accuracy, and outperforms comparable split-CNN or split-SNN baselines in efficiency.
2.3 Collaborative ViT Decomposition (DeViT)
DeViT explores decomposition of a large ViT (e.g., ViT-L/16) into small ViT sub-models, each running in parallel on homogeneous edge devices. Key features:
- Each submodel is obtained by shrinking depth, heads, embedding dimension, and MLP hidden size.
- Data partitions are assigned to students, and knowledge distillation using both prediction and intermediate feature losses aligns the students with the teacher.
- During inference, each device sends a [CLS] vector to a central aggregator, which fuses via a two-layer MLP.
Empirical results demonstrate that DeViT achieves 2–3× accelerated inference with <2 percentage points accuracy loss relative to full-sized models, and strictly dominates MobileViT-S in both latency and energy consumption on edge testbeds (Xu et al., 2023).
2.4 LUT-Based Channel Mixers for FPGA (LL-ViT)
LL-ViT targets edge FPGA deployment by replacing the dense MLP channel mixer of each encoder block with a differentiable look-up-table (LUT) network. The LUT block employs thermometer encoding and learned tables, eliminating all multiplications in the channel mixing step. LL-ViT preserves the sequence modeling capability of the original ViT, with up to 62% model size reduction, 50% fewer arithmetic operations, 1.3× lower latency, and 1.9× the energy efficiency versus int8 ViT accelerators—all without measurable accuracy sacrifice (Nag et al., 2 Nov 2025).
3. Efficiency, Evaluation Methodologies, and Empirical Results
3.1 On-Device Performance Measurement
EdgeViTs are evaluated directly on commodity mobile hardware (e.g., Samsung Galaxy S21, Snapdragon 888 CPU), using mean inference latency and energy measured over 50 samples. Energy draw is assessed with a Monsoon power monitor, and all models run full-precision TorchScript for realism (Pan et al., 2022). Distributed EdgeViT evaluations use Raspberry Pi 4B, quantifying total inference time, energy, and accuracy against strong model baselines (Liu et al., 2024, Xu et al., 2023).
3.2 ImageNet-1K and Downstream Vision Tasks
EdgeViT-XXS obtains 74.4% top-1 ImageNet-1K accuracy at 32.8 ms (127.4 mJ). EdgeViT-S achieves 81.0% top-1 at 85.3 ms (386.7 mJ), forming a Pareto frontier for accuracy vs. latency/energy. EdgeViTs also dominate competing light-ViTs and classical CNNs (e.g., MobileViT, PVTv2-B0/B1), with near-identical or superior trade-off metrics.
Dense prediction tasks show similar backbone advantages: on COCO, EdgeViT-XXS yields 38.7 AP versus PVTv2-B0 at 37.2 AP, with 2× lower inference time. Mask R-CNN and semantic segmentation results replicate these gains (Pan et al., 2022).
3.3 Distributed/Partitioned ViT Results
ED-ViT partitions a ViT-Base (86.6M params) to 10 edge devices, achieving a latency drop from 36.94 s to 1.28 s (+28.9×), with only 1% accuracy loss (CIFAR-10). Model storage falls from 327 MB to <10 MB (Liu et al., 2024). DeViT achieves 2.89× reduction in end-to-end latency versus ViT-L/16 on CIFAR-100, with <1% accuracy drop, and delivers 1.72× faster inference and 55% less energy than MobileViT-S on Jetson Nano with higher accuracy (Xu et al., 2023).
4. Theoretical Principles and Components
4.1 Local–Global–Local Bottleneck
Define ; the LGL block operates as:
- LocalAgg: pointwise → depthwise (k=3) → pointwise conv, cost .
- SparseGlobalAttn: self-attention on of tokens per window, cost .
- LocalProp: depthwise separable transposed conv with kernel/stride , cost .
- FFN: channel-mixing two-layer MLP.
The total block cost is , strongly sub-quadratic in for and . This factorization makes EdgeViT practical for edge deployments (Pan et al., 2022).
4.2 Structured Pruning and Model Assignment (ED-ViT)
EdgeViT’s class-wise pruning computes KL-divergence between the full and pruned models’ outputs, pruning residual channels, attention heads, and FFN neurons per class, retaining MHSA heads and scaling width by . Resulting FLOPs and parameter count scale quadratically: times the original model. Assignment uses greedy heuristics to pack submodels within device-specific compute/memory limits. Result fusion leverages gathering low-dimensional class scores and a fusion MLP for the final -class output (Liu et al., 2024).
5. Training Protocols and Hyperparameters
EdgeViTs are typically trained from scratch (ImageNet-1K: 1.28M train/50K val), using AdamW optimizer (base learning rate $1$e–3, weight decay $5$e–2, momentum 0.9), batch size 1024, over 300 epochs, with a 5-epoch warm-up and cosine LR decay. Data augmentations include random crop, horizontal flip, MixUp, CutMix, RandAug, label smoothing, and random erasing. Neither knowledge distillation nor external datasets are used (Pan et al., 2022). For pruned/partitioned settings, fine-tuning is performed on sub-task datasets for each sub-model, and, if required, on the fusion MLP for overall recovery in accuracy (Liu et al., 2024, Xu et al., 2023).
6. Practical Considerations and Deployment Implications
EdgeViT architectures favor standard, highly optimized primitives—pointwise/depthwise-convolutions, small MHSA, strided subsampling—to maximize support across diverse edge frameworks without reliance on exotic kernels or quantization. They can be directly integrated as backbones in edge-vision pipelines (classification, detection, segmentation) with minimal adjustment. Practical edge deployment should prioritize measured on-device latency and energy over proxy statistics like FLOPs or parameter count.
Distributed and collaborative inference variants require network broadcast and gather, and benefit from partitioning granularity aligned with device memory/compute constraints and communication bandwidth. Notable trade-offs arise in the number of devices (partition granularity), risk of reduced accuracy when class-level context is heavily distributed, and additional coordination complexity for output fusion or model retraining. Nevertheless, EdgeViT methods comprehensively outperform previous split-CNN/SNN baselines and classical quantization approaches in combined accuracy, latency, and energy metrics (Liu et al., 2024, Xu et al., 2023, Nag et al., 2 Nov 2025).
7. Summary and Impact
EdgeViT represents a set of model designs and algorithmic frameworks enabling ViT architectures to match or exceed lightweight CNNs in edge scenarios, both for standalone and distributed inference. Key innovations include the LGL bottleneck for sub-quadratic full-information exchange, class-wise partitioning and pruning for distributed inference, and hardware-friendly, LUT-based channel mixers for FPGA acceleration. Across canonical vision and audio datasets, EdgeViT and its algorithmic analogues provide substantial reductions in latency, energy, and model size, while preserving, and in some cases improving, predictive accuracy. They directly address the central barriers for scalable edge transformer adoption and provide a blueprint for the next generation of edge-suitable attention-based models (Pan et al., 2022, Liu et al., 2024, Xu et al., 2023, Nag et al., 2 Nov 2025).