MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models (2210.01820v2)

Published 4 Oct 2022 in cs.CV

Abstract: This paper presents MOAT, a family of neural networks that build on top of MObile convolution (i.e., inverted residual blocks) and ATtention. Unlike the current works that stack separate mobile convolution and transformer blocks, we effectively merge them into a MOAT block. Starting with a standard Transformer block, we replace its multi-layer perceptron with a mobile convolution block, and further reorder it before the self-attention operation. The mobile convolution block not only enhances the network representation capacity, but also produces better downsampled features. Our conceptually simple MOAT networks are surprisingly effective, achieving 89.1% / 81.5% top-1 accuracy on ImageNet-1K / ImageNet-1K-V2 with ImageNet22K pretraining. Additionally, MOAT can be seamlessly applied to downstream tasks that require large resolution inputs by simply converting the global attention to window attention. Thanks to the mobile convolution that effectively exchanges local information between pixels (and thus cross-windows), MOAT does not need the extra window-shifting mechanism. As a result, on COCO object detection, MOAT achieves 59.2% box AP with 227M model parameters (single-scale inference, and hard NMS), and on ADE20K semantic segmentation, MOAT attains 57.6% mIoU with 496M model parameters (single-scale inference). Finally, the tiny-MOAT family, obtained by simply reducing the channel sizes, also surprisingly outperforms several mobile-specific transformer-based models on ImageNet. The tiny-MOAT family is also benchmarked on downstream tasks, serving as a baseline for the community. We hope our simple yet effective MOAT will inspire more seamless integration of convolution and self-attention. Code is publicly available.

Citations (56)

View on Semantic Scholar

Summary

The paper introduces MOAT, a novel vision model architecture that integrates Mobile Convolution (MBConv) and Attention mechanisms into a single block to enhance model performance.
Empirical results show MOAT achieves state-of-the-art accuracy on ImageNet-1K and strong performance on downstream tasks like COCO object detection and ADE20K semantic segmentation.
The MOAT architecture suggests a paradigm shift towards hybrid designs, offering computational efficiency and potential for deployment on resource-constrained devices.

Overview of MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models

This paper introduces MOAT, a novel architecture family in deep learning that aims to enhance the efficacy of vision models by fusing Mobile Convolution (MBConv) with Attention mechanisms in a unified block. The principal innovation lies in integrating these two components—typically handled as distinct blocks—into a single, more efficient MOAT block. The approach demonstrates significant advancements in image classification and various downstream tasks without additional complex operations.

The researchers present a detailed comparison between MBConv and Transformer blocks, noting their structural similarities but also significant differences, such as MBConv's use of depthwise convolutions for local interactions. By replacing the Transformers' MLP module with MBConv and reversing the order to prioritize MBConv before self-attention, MOAT blocks effectively combine the strengths of both components, enriching the network's representational capabilities and facilitating better downsampling feature generation.

Empirically, MOAT models exhibit notable performance improvements across various computer vision benchmarks. With substantial pretraining on ImageNet-22K, the models achieve state-of-the-art top-1 accuracies on ImageNet-1K and ImageNet-1K-V2, validating the potency of the architectural adjustments. The models’ adaptability to large-resolution input tasks like COCO object detection and ADE20K semantic segmentation further highlights their versatility, outperforming several contemporary methods in terms of Average Precision (AP) and Mean Intersection Over Union (mIoU).

Numerical Results and Claims

The MOAT architecture achieves impressive numerical benchmarks:

On ImageNet-1K, MOAT achieves 89.1% top-1 accuracy with ImageNet-22K pretraining, surpassing prior state-of-the-art results.
In COCO object detection, MOAT reaches 59.2% AP with 227M model parameters.
For ADE20K semantic segmentation, it attains 57.6% mIoU with 496M parameters.

These results underscore MOAT's effectiveness in both classification and segmentation tasks, and its modular design facilitates parallel computations without necessitating additional window-shifting mechanics typical of other attention architectures.

Implications and Future Developments

The fusion of mobile convolution and attention in MOAT blocks introduces a compelling paradigm shift in neural network design, suggesting a hybrid architecture could enhance performance by balancing efficiency and representation power. The simplification achieved by merging convolution and self-attention into a single block has practical implications by reducing computational overhead, making MOAT appealing for deploying on resource-constrained mobile devices.

From a theoretical standpoint, the work paves the way for further research into hybrid architectures that blend convolutional and transformer-like operations. In future studies, adapting the MOAT architecture to different modalities and exploring the capabilities of the tiny-MOAT variants further may lead to optimized designs for mobile applications and real-time processing tasks.

In conclusion, the MOAT family represents a step forward in vision model design, combining robust theoretical foundations with empirical success across diverse tasks. This paper serves as an exemplary instance of the benefits derived from incorporating convolutional principles into transformer frameworks, encouraging a rethinking of established practices in neural network architecture design.

Related Papers

YouTube

Show All Videos