Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction (2205.14756v6)

Published 29 May 2022 in cs.CV

Abstract: High-resolution dense prediction enables many appealing real-world applications, such as computational photography, autonomous driving, etc. However, the vast computational cost makes deploying state-of-the-art high-resolution dense prediction models on hardware devices difficult. This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention. Unlike prior high-resolution dense prediction models that rely on heavy softmax attention, hardware-inefficient large-kernel convolution, or complicated topology structure to obtain good performances, our multi-scale linear attention achieves the global receptive field and multi-scale learning (two desirable features for high-resolution dense prediction) with only lightweight and hardware-efficient operations. As such, EfficientViT delivers remarkable performance gains over previous state-of-the-art models with significant speedup on diverse hardware platforms, including mobile CPU, edge GPU, and cloud GPU. Without performance loss on Cityscapes, our EfficientViT provides up to 13.9$\times$ and 6.2$\times$ GPU latency reduction over SegFormer and SegNeXt, respectively. For super-resolution, EfficientViT delivers up to 6.4x speedup over Restormer while providing 0.11dB gain in PSNR. For Segment Anything, EfficientViT delivers 48.9x higher throughput on A100 GPU while achieving slightly better zero-shot instance segmentation performance on COCO.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017.
  2. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  3. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
  4. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
  5. Object-contextual representations for semantic segmentation. In European conference on computer vision, pages 173–190. Springer, 2020.
  6. Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 43(10):3349–3364, 2020.
  7. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34, 2021.
  8. Attention is all you need. In NeurIPS, 2017.
  9. Segnext: Rethinking convolutional attention design for semantic segmentation. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  10. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2022.
  11. Lite pose: Efficient architecture design for 2d human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2022.
  12. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156–5165. PMLR, 2020.
  13. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  14. Hydra attention: Efficient attention with many heads. In Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VII, pages 35–49. Springer, 2023.
  15. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
  16. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3531–3539, 2021.
  17. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
  18. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  19. Coatnet: Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems, 34:3965–3977, 2021.
  20. A convnet for the 2020s. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2022.
  21. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
  22. Efficientnetv2: Smaller models and faster training. In International Conference on Machine Learning, pages 10096–10106. PMLR, 2021.
  23. Fastervit: Fast vision transformers with hierarchical attention. arXiv preprint arXiv:2306.06189, 2023.
  24. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  25. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
  26. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 126–135, 2017.
  27. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 2, pages 416–423. IEEE, 2001.
  28. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  29. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  30. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  31. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1833–1844, 2021.
  32. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.
  33. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022.
  34. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34, 2021.
  35. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5728–5739, 2022.
  36. Efficient image super-resolution using vast-receptive-field attention. arXiv preprint arXiv:2210.05960, 2022.
  37. Blueprint separable residual network for efficient image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 833–843, 2022.
  38. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  39. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019.
  40. Exploring plain vision transformer backbones for object detection. arXiv preprint arXiv:2203.16527, 2022.
  41. Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European conference on computer vision (ECCV), pages 405–420, 2018.
  42. Fast-scnn: Fast semantic segmentation network. arXiv preprint arXiv:1902.04502, 2019.
  43. Dfanet: Deep feature aggregation for real-time semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9522–9531, 2019.
  44. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 325–341, 2018.
  45. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.
  46. Searching for mobilenetv3. In ICCV, 2019.
  47. Once for all: Train one network and specialize it for efficient deployment. In ICLR, 2020.
  48. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1580–1589, 2020.
  49. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. In International Conference on Learning Representations, 2022.
  50. Mobile-former: Bridging mobilenet and transformer. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2022.
  51. NASVit: Neural architecture search for efficient vision transformers with gradient conflict aware supernet training. In International Conference on Learning Representations, 2022.
  52. Learning both weights and connections for efficient neural network. In NeurIPS, 2015.
  53. Channel pruning for accelerating very deep neural networks. In ICCV, 2017.
  54. Learning efficient convolutional networks through network slimming. In ICCV, 2017.
  55. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In ICLR, 2016.
  56. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV, 2018.
  57. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  58. Network augmentation for tiny deep learning. arXiv preprint arXiv:2110.08890, 2021.
  59. Tinytl: Reduce memory, not parameters for efficient on-device learning. Advances in Neural Information Processing Systems, 33:11285–11297, 2020.
  60. Neural architecture search with reinforcement learning. In ICLR, 2017.
  61. Efficient architecture search by network transformation. In AAAI, 2018.
  62. ProxylessNAS: Direct neural architecture search on target task and hardware. In ICLR, 2019.
  63. Amc: Automl for model compression and acceleration on mobile devices. In ECCV, 2018.
  64. Apq: Joint search for network architecture, pruning and quantization policy. In CVPR, 2020.
Citations (36)

Summary

  • The paper introduces a ReLU linear attention mechanism that reduces computational complexity from quadratic to linear for high-resolution tasks.
  • It integrates lightweight convolution with multi-scale learning to enhance local feature extraction and overall model efficiency.
  • Experimental results demonstrate up to 13.9× GPU latency reduction on Cityscapes and notable speedups in semantic segmentation and super-resolution.

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

The paper "EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction" addresses the challenge of deploying high-resolution dense prediction models in resource-constrained environments, such as mobile devices and edge GPUs. EfficientViT introduces a novel family of vision transformer models that employ a unique multi-scale linear attention mechanism, which enhances performance while significantly improving computational efficiency.

Key Contributions

EfficientViT's principal innovation lies in substituting conventional softmax attention with ReLU linear attention, which offers a global receptive field and enables multi-scale learning. The ReLU linear attention effectively reduces computational complexity from quadratic to linear, making it more suitable for hardware deployment by avoiding operations such as softmax, which are typically hardware inefficient.

The integration of convolutional operations with ReLU linear attention addresses its inherent limitations, particularly in local information extraction. The proposed multi-scale linear attention module aggregates features using lightweight small-kernel convolutions, enhancing both local information extraction and multi-scale learning capabilities. This architectural approach ensures a balance between performance and efficiency.

Experimental Validation

EfficientViT demonstrates its efficacy across various high-resolution prediction tasks, such as semantic segmentation and super-resolution. The experiments, conducted using datasets like Cityscapes and ADE20K, reveal substantial speedups over state-of-the-art models such as SegFormer and SegNeXt without compromising on performance. For instance, EfficientViT achieves up to 13.9× GPU latency reduction on Cityscapes compared to SegFormer, while maintaining or exceeding its performance metrics.

In the domain of super-resolution, EfficientViT delivers up to 6.4× speedup over Restormer on high-resolution benchmarks, with a notable PSNR gain. Robust performance is also observed in segmenting anything tasks, with EfficientViT providing 48.9× higher throughput than SAM-ViT-Huge, while achieving superior zero-shot instance segmentation performance on COCO.

Theoretical and Practical Implications

The advancements proposed by EfficientViT have significant implications for both theoretical exploration and practical applications. The introduction of linear attention in high-resolution dense prediction highlights a shift from traditional attention mechanisms, suggesting new pathways in Transformer-based model design.

Practically, EfficientViT's efficiency on diverse hardware, such as mobile CPUs, edge GPUs, and cloud GPUs, affirms its potential for real-world deployments. The linear attention approach's compatibility with existing hardware accelerators could drive broader adoption of Transformer architectures in edge computing scenarios.

Future Prospects

Given the modular nature of EfficientViT's design, future research could explore its applicability to other domains requiring dense prediction, such as medical imaging and computational photography. Moreover, advancements in hardware support for efficient convolution operations could further amplify the performance gains offered by this architecture.

The paper opens up intriguing possibilities for exploring the intersection of resource efficiency and high-performance in neural networks, a crucial consideration in the ongoing expansion of AI applications across various domains.

Youtube Logo Streamline Icon: https://streamlinehq.com