HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions (2207.14284v3)

Published 28 Jul 2022 in cs.CV

Abstract: Recent progress in vision Transformers exhibits great success in various tasks driven by the new spatial modeling mechanism based on dot-product self-attention. In this paper, we show that the key ingredients behind the vision Transformers, namely input-adaptive, long-range and high-order spatial interactions, can also be efficiently implemented with a convolution-based framework. We present the Recursive Gated Convolution ($\textit{g}^{\textit{n}$Conv)} that performs high-order spatial interactions with gated convolutions and recursive designs. The new operation is highly flexible and customizable, which is compatible with various variants of convolution and extends the two-order interactions in self-attention to arbitrary orders without introducing significant extra computation. $\textit{g}^{\textit{n}$Conv} can serve as a plug-and-play module to improve various vision Transformers and convolution-based models. Based on the operation, we construct a new family of generic vision backbones named HorNet. Extensive experiments on ImageNet classification, COCO object detection and ADE20K semantic segmentation show HorNet outperform Swin Transformers and ConvNeXt by a significant margin with similar overall architecture and training configurations. HorNet also shows favorable scalability to more training data and larger model sizes. Apart from the effectiveness in visual encoders, we also show $\textit{g}^{\textit{n}$Conv} can be applied to task-specific decoders and consistently improve dense prediction performance with less computation. Our results demonstrate that $\textit{g}^{\textit{n}$Conv} can be a new basic module for visual modeling that effectively combines the merits of both vision Transformers and CNNs. Code is available at https://github.com/raoyongming/HorNet

Authors (6)

Yongming Rao (50 papers)
Wenliang Zhao (22 papers)
Yansong Tang (81 papers)
Jie Zhou (687 papers)
Ser-Nam Lim (116 papers)
Jiwen Lu (192 papers)

Citations (208)

View on Semantic Scholar

Summary

Analysis of "HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions"

The paper at hand introduces HorNet, a novel approach focusing on efficient high-order spatial interactions utilizing Recursive Gated Convolutions ( $g^n$ Conv). The core objective is to marry the advantageous features of convolutional neural networks (CNNs) and vision Transformers to achieve superior performance in visual tasks such as image classification, object detection, and semantic segmentation.

Overview and Methodology

At the heart of the research lies the $g^n$ Conv operation, which enables high-order spatial interactions. This is significant as it goes beyond the traditional self-attention mechanisms typically associated with vision Transformers. The proposed approach suggests that these interactions, usually achieved through complex dot-product attention, can be effectively implemented using a convolution-based framework. The Recursive Gated Convolution aims to provide a mechanism that not only simplifies computations compared to self-attention but also retains key advantages such as input adaptivity, translation equivariance, and long-range interactions.

The research delineates the architecture in a methodical manner, starting from the conception of high-order interactions using recursive designs. This mechanism is further optimized with an efficient use of architectural resources, ensuring that these higher-order interactions do not substantially increase computational costs. The HorNet architecture is constructed utilizing $g^n$ Conv, maintaining the balance between performance and resource usage, thus ensuring that the model scales efficiently for larger, more complex tasks.

Experimental Results

Extensive experiments validate the superiority of the HorNet architecture over both Swin Transformers and ConvNeXt in multiple visual tasks. The results presented are notable: HorNet achieves a substantial performance uplift with similar computational resources. For instance, on ImageNet classification, HorNet demonstrates top-tier accuracy, surpassing its peers with significant margins even as it scales to higher resolutions and model sizes. The research also showcases the model's potential in dense prediction tasks, including COCO object detection and ADE20K semantic segmentation, further generalizing its applicability.

Technical Achievements

The technical achievements in this paper are underlined by strong numerical results, illustrating the efficacy of high-order interactions facilitated by $g^n$ Conv. The realization that CNN-like operations can approximate the complex interdependencies modeled by Transformers opens a new direction in neural architecture development. Particularly, the scalability of HorNet in terms of data volume and model size emphasizes its practical utility across various AI applications.

Theoretical and Practical Implications

Theoretically, this paper contributes to the understanding of spatial interactions in neural networks by reconceptualizing how high-order interactions can be efficiently integrated into traditional convolution frameworks. It challenges the dominant narrative that only Transformer-based architectures can capture the intricate dependencies required for sophisticated visual understanding.

Practically, the flexibility and efficiency of HorNet hold promising implications for resource-constrained environments, where maintaining high accuracy without incurring large computational or memory overheads is crucial. By providing a scalable solution that exceeds the current capabilities of established models, HorNet is poised to be widely adopted in real-world applications where computational efficiency is paramount.

Future Directions

The promising results open avenues for further exploration around simplifying high-order interactions even further and optimizing hardware implementations. There is potential to integrate such frameworks into broader AI systems, potentially enhancing lower layers of neural networks with the recursive interaction capability proposed in HorNet. Future work could also explore hybrid architectures that might further blend the gap between CNNs and Transformers, crafting models that excel in both efficiency and performance.

In summary, the introduction of HorNet and the $g^n$ Conv operation marks a significant contribution to the field of visual recognition, challenging the status quo and prompting a reevaluation of how spatial interactions are fundamentally conceptualized and realized within neural network architectures.

PDF Markdown

Related Papers

Focal Self-attention for Local-Global Interactions in Vision Transformers (2021)
MaxViT: Multi-Axis Vision Transformer (2022)
Dilated Neighborhood Attention Transformer (2022)
Visual Transformer for Object Detection (2022)
Glance-and-Gaze Vision Transformer (2021)

Find Related Papers

GitHub

GitHub - raoyongming/HorNet: [NeurIPS 2022] HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions (337 stars)