EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction (2205.14756v6)

Published 29 May 2022 in cs.CV

Abstract: High-resolution dense prediction enables many appealing real-world applications, such as computational photography, autonomous driving, etc. However, the vast computational cost makes deploying state-of-the-art high-resolution dense prediction models on hardware devices difficult. This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention. Unlike prior high-resolution dense prediction models that rely on heavy softmax attention, hardware-inefficient large-kernel convolution, or complicated topology structure to obtain good performances, our multi-scale linear attention achieves the global receptive field and multi-scale learning (two desirable features for high-resolution dense prediction) with only lightweight and hardware-efficient operations. As such, EfficientViT delivers remarkable performance gains over previous state-of-the-art models with significant speedup on diverse hardware platforms, including mobile CPU, edge GPU, and cloud GPU. Without performance loss on Cityscapes, our EfficientViT provides up to 13.9$\times$ and 6.2$\times$ GPU latency reduction over SegFormer and SegNeXt, respectively. For super-resolution, EfficientViT delivers up to 6.4x speedup over Restormer while providing 0.11dB gain in PSNR. For Segment Anything, EfficientViT delivers 48.9x higher throughput on A100 GPU while achieving slightly better zero-shot instance segmentation performance on COCO.

References (64)

Citations (36)

View on Semantic Scholar

Summary

The paper introduces a ReLU linear attention mechanism that reduces computational complexity from quadratic to linear for high-resolution tasks.
It integrates lightweight convolution with multi-scale learning to enhance local feature extraction and overall model efficiency.
Experimental results demonstrate up to 13.9× GPU latency reduction on Cityscapes and notable speedups in semantic segmentation and super-resolution.

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

The paper "EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction" addresses the challenge of deploying high-resolution dense prediction models in resource-constrained environments, such as mobile devices and edge GPUs. EfficientViT introduces a novel family of vision transformer models that employ a unique multi-scale linear attention mechanism, which enhances performance while significantly improving computational efficiency.

Key Contributions

EfficientViT's principal innovation lies in substituting conventional softmax attention with ReLU linear attention, which offers a global receptive field and enables multi-scale learning. The ReLU linear attention effectively reduces computational complexity from quadratic to linear, making it more suitable for hardware deployment by avoiding operations such as softmax, which are typically hardware inefficient.

The integration of convolutional operations with ReLU linear attention addresses its inherent limitations, particularly in local information extraction. The proposed multi-scale linear attention module aggregates features using lightweight small-kernel convolutions, enhancing both local information extraction and multi-scale learning capabilities. This architectural approach ensures a balance between performance and efficiency.

Experimental Validation

EfficientViT demonstrates its efficacy across various high-resolution prediction tasks, such as semantic segmentation and super-resolution. The experiments, conducted using datasets like Cityscapes and ADE20K, reveal substantial speedups over state-of-the-art models such as SegFormer and SegNeXt without compromising on performance. For instance, EfficientViT achieves up to 13.9× GPU latency reduction on Cityscapes compared to SegFormer, while maintaining or exceeding its performance metrics.

In the domain of super-resolution, EfficientViT delivers up to 6.4× speedup over Restormer on high-resolution benchmarks, with a notable PSNR gain. Robust performance is also observed in segmenting anything tasks, with EfficientViT providing 48.9× higher throughput than SAM-ViT-Huge, while achieving superior zero-shot instance segmentation performance on COCO.

Theoretical and Practical Implications

The advancements proposed by EfficientViT have significant implications for both theoretical exploration and practical applications. The introduction of linear attention in high-resolution dense prediction highlights a shift from traditional attention mechanisms, suggesting new pathways in Transformer-based model design.

Practically, EfficientViT's efficiency on diverse hardware, such as mobile CPUs, edge GPUs, and cloud GPUs, affirms its potential for real-world deployments. The linear attention approach's compatibility with existing hardware accelerators could drive broader adoption of Transformer architectures in edge computing scenarios.

Future Prospects

Given the modular nature of EfficientViT's design, future research could explore its applicability to other domains requiring dense prediction, such as medical imaging and computational photography. Moreover, advancements in hardware support for efficient convolution operations could further amplify the performance gains offered by this architecture.

The paper opens up intriguing possibilities for exploring the intersection of resource efficiency and high-performance in neural networks, a crucial consideration in the ongoing expansion of AI applications across various domains.

PDF Markdown

Related Papers

GitHub

GitHub - mit-han-lab/efficientvit: EfficientViT is a new family of vision models for efficient high-resolution vision. (2,958 stars)

Tweets

https://twitter.com/EHuanglu/status/1848699190995783799

https://twitter.com/dreamingtulpa/status/1848303932226830471

https://twitter.com/taziku_co/status/1849067244921442668

YouTube

Show All Videos