MogaNet: Multi-order Gated Aggregation Network (2211.03295v4)

Published 7 Nov 2022 in cs.CV and cs.AI

Abstract: By contextualizing the kernel as global as possible, Modern ConvNets have shown great potential in computer vision tasks. However, recent progress on multi-order game-theoretic interaction within deep neural networks (DNNs) reveals the representation bottleneck of modern ConvNets, where the expressive interactions have not been effectively encoded with the increased kernel size. To tackle this challenge, we propose a new family of modern ConvNets, dubbed MogaNet, for discriminative visual representation learning in pure ConvNet-based models with favorable complexity-performance trade-offs. MogaNet encapsulates conceptually simple yet effective convolutions and gated aggregation into a compact module, where discriminative features are efficiently gathered and contextualized adaptively. MogaNet exhibits great scalability, impressive efficiency of parameters, and competitive performance compared to state-of-the-art ViTs and ConvNets on ImageNet and various downstream vision benchmarks, including COCO object detection, ADE20K semantic segmentation, 2D&3D human pose estimation, and video prediction. Notably, MogaNet hits 80.0% and 87.8% accuracy with 5.2M and 181M parameters on ImageNet-1K, outperforming ParC-Net and ConvNeXt-L, while saving 59% FLOPs and 17M parameters, respectively. The source code is available at https://github.com/Westlake-AI/MogaNet.

Citations (38)

View on Semantic Scholar

Summary

The paper proposes a novel ConvNet architecture that integrates multi-order interactions using gated aggregations.
It introduces a Moga Block with spatial and channel aggregation to effectively balance low-, middle-, and high-order feature interactions.
Extensive experiments show superior performance and efficiency on ImageNet-1K and COCO benchmarks compared to conventional models.

Overview of MogaNet: Multi-order Gated Aggregation Network

MogaNet introduces a novel family of ConvNets tailored for discriminative visual representation. The paper highlights the challenges faced by traditional ConvNets, especially regarding their inability to effectively encapsulate multi-order interactions despite increasing kernel sizes. This proposed architecture encapsulates both convolutions and gated aggregations to build a sophisticated yet efficient network capable of balancing complexity with performance across various computer vision benchmarks.

Methodology and Design

Multi-order Interaction: A foundational principle of MogaNet is the multi-order game-theoretic interaction within DNNs, focusing on interactions that balance between low-order and high-order for achieving comprehensive expressivity. Unlike preceding ConvNets, MogaNet aims to harness the untapped potential of middle-order interactions which often provide more discriminative power.

Moga Block: The core unit of MogaNet includes a spatial aggregation block and a channel aggregation block. The spatial aggregation block uses a combination of feature decomposition and multi-order depth-wise convolutions (DWConv) guided by a gating mechanic. This configuration enables parallel processing of low, middle, and high-order features, enhancing information representation with controlled computational overheads.

Channel Aggregation: The emphasis is on reducing channel-wise redundancy. The novel channel aggregation module reallocates enriched channel data, prioritizing discriminative middle-order interactions over trivial redundancies.

Experimental Results and Analysis

MogaNet's architecture demonstrates superior performance across a variety of tasks, including image classification, object detection, semantic segmentation, and pose estimation. It achieves notable gains in ImageNet-1K classification, with MogaNet-T reaching 80.0% top-1 accuracy while employing fewer parameters and FLOPs than several contemporary models. Notably, the large model variant, MogaNet-XL, hits an impressive 87.8% accuracy, confirming its scalability and competitive edge over extensive pre-trained networks like ConvNeXt-XL.

With regard to object detection and segmentation on COCO datasets, MogaNet variants consistently outperform many leading architectures, such as Swin-T and ConvNeXt-L, both in terms of accuracy and parameter efficiency. The proposed network excels at capturing a balanced distribution of interactions, which is affirmed by the distributions of interaction strength $J^{(m)}$ , showcasing its ability to emphasize middle-order interactions effectively.

Implications and Future Directions

MogaNet's architecture contributes significantly to the ongoing discussion on optimal network designs in deep learning—particularly in terms of how networks can better balance low-, middle-, and high-order interactions. From a theoretical standpoint, its focus on enforcing middle-order interaction learning sheds light on new methodologies for designing robust and scalable networks. Practically, MogaNet's efficient complexity-to-performance ratio opens doors for broader application in both high- and low-resource contexts.

Future work may focus on broader implementation of multi-order aggregation principles across different neural architectures or application domains. Investigations into further enhancing the adaptive reallocation mechanisms within network layers could also promote improved integration and utility of large-scale global interactions.

In summary, MogaNet emerges as a significant advancement in ConvNet architecture, elegantly integrating multi-order interaction capabilities with streamlined computational efficiency to advance visual representation learning.