Global Filter Networks for Image Classification (2107.00645v2)

Published 1 Jul 2021 in cs.CV, cs.AI, and cs.LG

Abstract: Recent advances in self-attention and pure multi-layer perceptrons (MLP) models for vision have shown great potential in achieving promising performance with fewer inductive biases. These models are generally based on learning interaction among spatial locations from raw data. The complexity of self-attention and MLP grows quadratically as the image size increases, which makes these models hard to scale up when high-resolution features are required. In this paper, we present the Global Filter Network (GFNet), a conceptually simple yet computationally efficient architecture, that learns long-term spatial dependencies in the frequency domain with log-linear complexity. Our architecture replaces the self-attention layer in vision transformers with three key operations: a 2D discrete Fourier transform, an element-wise multiplication between frequency-domain features and learnable global filters, and a 2D inverse Fourier transform. We exhibit favorable accuracy/complexity trade-offs of our models on both ImageNet and downstream tasks. Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness. Code is available at https://github.com/raoyongming/GFNet

Citations (384)

View on Semantic Scholar

Summary

The paper introduces a novel architecture that replaces self-attention with DFT-based global filtering to achieve log-linear computational complexity.
The model outperforms similar architectures like DeiT and ResMLP on ImageNet by balancing accuracy with reduced resource demands.
The GFNet's flexible design adapts to higher resolutions and transfer learning tasks, highlighting its potential for real-time and resource-constrained applications.

Global Filter Networks for Image Classification: An Academic Overview

The paper introduces the Global Filter Network (GFNet), a novel architecture for image classification that leverages the capabilities of discrete Fourier transforms (DFT) to efficiently capture long-range spatial dependencies in vision tasks. This research addresses the computational limitations associated with traditional vision transformers and multi-layer perceptrons (MLP) by proposing a method that achieves log-linear complexity in processing image data.

Model Architecture and Methodology

The GFNet architecture strategically replaces the computationally intensive self-attention layers commonly found in vision transformers with a more efficient frequency domain processing approach. The core operations involve:

A 2D discrete Fourier transform to shift spatial features into the frequency domain.
Element-wise multiplication of frequency-domain features with learnable global filters.
A 2D inverse Fourier transform to revert the features back to the spatial domain.

This method allows for efficient token mixing with a computational complexity of $O(L \log L)$ , where $L$ represents the number of tokens, making it less sensitive to input resolution increases.

Experimental Results

The authors report significant accuracy and efficiency trade-offs on ImageNet, outperforming several existing models including DeiT and ResMLP under comparable computational budgets. GFNet models demonstrate robustness and adaptability in hierarchical architectures, facilitating flexible scaling to larger image resolutions without substantial modifications.

Additionally, the GFNet exhibits competitive performance in transfer learning tasks across datasets such as CIFAR-10, CIFAR-100, and Stanford Cars, supporting its generalization capabilities beyond the ImageNet dataset.

Implications and Future Directions

GFNet's approach to processing image data in the frequency domain represents a promising alternative to current vision transformer architectures. This shift not only alleviates computational load but also retains high accuracy, making it suitable for real-time and resource-constrained environments. The model architecture's inherent flexibility to adapt to varying resolutions reinforces its potential applicability across a wider range of vision tasks, including semantic segmentation and object detection.

Looking forward, the integration of GFNet in AI systems could substantially enhance the efficiency of large-scale image processing tasks. Further research could explore combining global filter networks with other neural architectures, such as convolutional approaches, to capitalize on their respective strengths.

Conclusion

The Global Filter Network presents a proficient and adaptable solution to the inherent challenges of scaling vision transformers. By utilizing DFT for efficient spatial dependency modeling, GFNet advances the landscape of image classification models, offering a viable path forward in the domain of computationally efficient AI vision systems.

PDF Markdown

Related Papers

GitHub

GitHub - raoyongming/GFNet: [NeurIPS 2021] [T-PAMI] Global Filter Networks for Image Classification (408 stars)