ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention (2405.18425v2)

Published 28 May 2024 in cs.CV and cs.AI

Abstract: Recently, linear complexity sequence modeling networks have achieved modeling capabilities similar to Vision Transformers on a variety of computer vision tasks, while using fewer FLOPs and less memory. However, their advantage in terms of actual runtime speed is not significant. To address this issue, we introduce Gated Linear Attention (GLA) for vision, leveraging its superior hardware-awareness and efficiency. We propose direction-wise gating to capture 1D global context through bidirectional modeling and a 2D gating locality injection to adaptively inject 2D local details into 1D global context. Our hardware-aware implementation further merges forward and backward scanning into a single kernel, enhancing parallelism and reducing memory cost and latency. The proposed model, ViG, offers a favorable trade-off in accuracy, parameters, and FLOPs on ImageNet and downstream tasks, outperforming popular Transformer and CNN-based models. Notably, ViG-S matches DeiT-B's accuracy while using only 27% of the parameters and 20% of the FLOPs, running 2$\times$ faster on $224\times224$ images. At $1024\times1024$ resolution, ViG-T uses 5.2$\times$ fewer FLOPs, saves 90% GPU memory, runs 4.8$\times$ faster, and achieves 20.7% higher top-1 accuracy than DeiT-T. These results position ViG as an efficient and scalable solution for visual representation learning. Code is available at \url{https://github.com/hustvl/ViG}.

PDF Abstract

Linear-Complexity Visual Sequence Learning with Gated Linear Attention: Insights and Implications

The advent of Vision Transformers (ViTs) ushered in a new paradigm for visual representation learning by leveraging the transformer architecture from NLP. However, the quadratic complexity of the transformer’s softmax attention poses substantial challenges, particularly in processing high-resolution images. This paper presents ViG, a novel vision backbone network that employs Gated Linear Attention (GLA) to achieve linear-complexity in visual sequence learning, while maintaining global receptive fields akin to traditional transformers.

Core Contributions

This work introduces several key advancements:

Gated Linear Attention (GLA): By adapting GLA to vision, the paper exploits its hardware efficiency and introduces mechanisms to inject both 1D and 2D context. This is accomplished through direction-wise gating, which captures global context bidirectionally, and 2D gating locality injection, which integrates local image details into the global context.
Efficient Bidirectional Modeling: The proposed Bidirectional Gated Linear Attention (BiGLA) shares most parameters for forward and backward processing, except for direction-specific gates, to optimize both memory usage and computational efficiency. This approach ensures a more parameter-efficient and hardware-friendly implementation by merging bidirectional scanning into a single kernel.
Performance Evaluation: ViG's performance on ImageNet demonstrates an advantageous trade-off between accuracy, parameters, and computational load. Notably, ViG-S is shown to match the accuracy of DeiT-B while requiring only 27% of the parameters and 20% of the FLOPs, highlighting its efficiency. Moreover, extensive benchmarks on downstream tasks such as object detection and semantic segmentation further verify ViG’s robustness and adaptability across different resolutions.

Theoretical and Practical Implications

Theoretically, the introduction of GLA and its adaptations for vision underscore a shift towards more computationally efficient model architectures that do not sacrifice the breadth of the model’s ability to understand global context. The GLA's linear complexity offers a profound advantage in scaling models for higher resolutions, addressing one of the primary limitations of traditional softmax-based transformers.

Practically, these developments hold significant promise for applications requiring real-time processing or deployment in constrained environments, such as mobile devices or real-world systems where computational resources are limited. The demonstrated efficiency in both computational and memory contexts suggests broad applicability in scenarios previously dominated by either CNNs or constrained transformers.

Future Directions

Looking forward, ViG paves the way for further exploitation of linear-complexity mechanisms in computer vision. Continued refinement of GLA, particularly in balancing the trade-offs between local and global information in varied contexts, will be crucial. Moreover, integrating such innovations with advancements in neural hardware could unlock new capabilities in on-device intelligence.

In summary, the ViG architecture offers a compelling blend of efficiency and scalability, expanding the toolkit available to practitioners and researchers alike seeking to leverage vision transformers in increasingly diverse and demanding environments. As AI continues to permeate varied domains, the significance of such efficient models can only be expected to grow, driving further innovation at the intersection of machine learning, hardware design, and practical application.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Bencheng Liao (20 papers)
Xinggang Wang (163 papers)
Lianghui Zhu (12 papers)
Qian Zhang (308 papers)
Chang Huang (46 papers)

Related Papers

Find Related Papers

GitHub

GitHub - hustvl/ViG: [AAAI 2025] Linear-complexity Visual Sequence Learning with Gated Linear Attention (96 stars)