Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Inception Transformer (2205.12956v2)

Published 25 May 2022 in cs.CV, cs.AI, and cs.LG

Abstract: Recent studies show that Transformer has strong capability of building long-range dependencies, yet is incompetent in capturing high frequencies that predominantly convey local information. To tackle this issue, we present a novel and general-purpose Inception Transformer, or iFormer for short, that effectively learns comprehensive features with both high- and low-frequency information in visual data. Specifically, we design an Inception mixer to explicitly graft the advantages of convolution and max-pooling for capturing the high-frequency information to Transformers. Different from recent hybrid frameworks, the Inception mixer brings greater efficiency through a channel splitting mechanism to adopt parallel convolution/max-pooling path and self-attention path as high- and low-frequency mixers, while having the flexibility to model discriminative information scattered within a wide frequency range. Considering that bottom layers play more roles in capturing high-frequency details while top layers more in modeling low-frequency global information, we further introduce a frequency ramp structure, i.e. gradually decreasing the dimensions fed to the high-frequency mixer and increasing those to the low-frequency mixer, which can effectively trade-off high- and low-frequency components across different layers. We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation. For example, our iFormer-S hits the top-1 accuracy of 83.4% on ImageNet-1K, much higher than DeiT-S by 3.6%, and even slightly better than much bigger model Swin-B (83.3%) with only 1/4 parameters and 1/3 FLOPs. Code and models will be released at https://github.com/sail-sg/iFormer.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Chenyang Si (36 papers)
  2. Weihao Yu (36 papers)
  3. Pan Zhou (220 papers)
  4. Yichen Zhou (21 papers)
  5. Xinchao Wang (203 papers)
  6. Shuicheng Yan (275 papers)
Citations (157)

Summary

Inception Transformer: A Novel Approach for Comprehensive Feature Learning in Visual Data

The paper "Inception Transformer" introduces a novel Transformer architecture, named iFormer, designed to enhance the model's capability in capturing both high- and low-frequency information within visual data. Transformer models have achieved remarkable success in modeling long-range dependencies, particularly in NLP and recent adaptations to vision tasks. However, a notable limitation of these models is their reduced sensitivity to high-frequency components that encapsulate most local information, such as edges and textures, which are pivotal for certain visual tasks.

Innovation and Technical Contributions

The iFormer architecture addresses this limitation through an innovative design called the Inception mixer, which is inspired by the concept of Inception modules commonly used in CNNs. The primary contribution of the iFormer is the merging of convolutional and Transformer-based architectures to simultaneously leverage high- and low-frequency information:

  • Inception Mixer: This component extends the inception module's concept by splitting the input channels into two parallel paths. One path utilizes convolution and max-pooling operations to emphasize high-frequency signals, while the other leverages the self-attention mechanism for capturing long-range dependencies and low-frequency information. This design enhances the Transformer model's ability to integrate rich visual representations at different frequency levels.
  • Frequency Ramp Structure: To efficiently balance high- and low-frequency learning across the architecture's layers, the authors propose a frequency ramp structure. This structure modulates the number of channels allocated to high- and low-frequency components from bottom to top layers, reflecting the human visual system's processing. Lower layers capture detail-rich high-frequency features, and higher layers focus on broader, low-frequency patterns.

Empirical Evaluation and Results

iFormer demonstrates substantial performance improvements over prior state-of-the-art methods across a spectrum of vision tasks: image classification, object detection, and segmentation. Notably, iFormer-S achieved a top-1 accuracy of 83.4% on ImageNet-1K, outperforming DeiT-S by 3.6% and slightly surpassing the substantially larger Swin-B model, all while maintaining a more efficient computational footprint with fewer parameters and FLOPs.

Empirical evaluations on COCO detection and ADE20K segmentation tasks confirm iFormer's superior ability to manage high- and low-frequency information, considerably boosting performance metrics, further cementing its efficacy as a robust vision backbone.

Theoretical and Practical Implications

The iFormer's architecture advances the state-of-the-art in Transformer models by demonstrating that strategic integration of convolutional elements significantly enhances high-frequency representation learning. This inclusion bridges a critical gap in the original Transformer design when applied to vision tasks, which traditionally emphasized low-frequency information due to its global attention mechanism.

Practically, iFormer holds promise as a versatile, efficient backbone for a wide range of vision applications where both local detail and global context are crucial, such as fine-grained classification, detection, and segmentation tasks.

Speculative Future Directions

Looking forward, the development of iFormer paves the way for further exploration into hybrid architectures that blend convolutional networks with Transformer-like attention mechanisms. Future research could explore dynamic allocation strategies for frequency-specific pathways mid-training or adaptive path selection based on the task at hand. Furthermore, application of similar architectural principles in video processing and multi-modal data could yield fruitful results, expanding the robustness and flexibility of vision models across different data forms.

In conclusion, iFormer stands as a significant contribution to the field of computer vision, providing a practical architectural solution for enhancing Transformer models' capability to learn comprehensive feature representations across frequency domains. This work not only offers improved performance but also brings new insights into the design and application of hybrid neural architectures.

Github Logo Streamline Icon: https://streamlinehq.com