Focal Modulation Networks (2203.11926v3)

Published 22 Mar 2022 in cs.CV, cs.AI, and cs.LG

Abstract: We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation mechanism for modeling token interactions in vision. Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges, (ii) gated aggregation to selectively gather contexts for each query token based on its content, and (iii) element-wise modulation or affine transformation to inject the aggregated context into the query. Extensive experiments show FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin and Focal Transformers) with similar computational costs on the tasks of image classification, object detection, and segmentation. Specifically, FocalNets with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K. After pretrained on ImageNet-22K in 224 resolution, it attains 86.5% and 87.3% top-1 accuracy when finetuned with resolution 224 and 384, respectively. When transferred to downstream tasks, FocalNets exhibit clear superiority. For object detection with Mask R-CNN, FocalNet base trained with 1\times outperforms the Swin counterpart by 2.1 points and already surpasses Swin trained with 3\times schedule (49.0 v.s. 48.5). For semantic segmentation with UPerNet, FocalNet base at single-scale outperforms Swin by 2.4, and beats Swin at multi-scale (50.5 v.s. 49.7). Using large FocalNet and Mask2former, we achieve 58.5 mIoU for ADE20K semantic segmentation, and 57.9 PQ for COCO Panoptic Segmentation. Using huge FocalNet and DINO, we achieved 64.3 and 64.4 mAP on COCO minival and test-dev, respectively, establishing new SoTA on top of much larger attention-based models like Swinv2-G and BEIT-3. Code and checkpoints are available at https://github.com/microsoft/FocalNet.

PDF Abstract

Focal Modulation Networks

The paper "Focal Modulation Networks" introduces an innovative architecture, FocalNets, which eliminates the reliance on self-attention (SA) by introducing a novel focal modulation module for token interactions in computer vision. This approach is structured around three key components: focal contextualization, gated aggregation, and element-wise affine transformation.

Key Components of Focal Modulation

Focal Contextualization: Utilizes depth-wise convolutional layers to capture and encode visual contexts ranging from short to long distances.
Gated Aggregation: Selectively accumulates contextual information into a modulator tailored for each query token, optimizing context extraction based on content relevance.
Element-wise Affine Transformation: Incorporates the modulator into the query, refining the representation through element-wise operations.

Experimental Validation

Extensive experiments demonstrate that FocalNets deliver superior interpretability and outperform state-of-the-art SA counterparts, such as Swin and Focal Transformers, across tasks like image classification, object detection, and segmentation.

Image Classification: FocalNets achieved top-1 accuracies of 82.3% and 83.9% for their tiny and base sizes respectively on ImageNet-1K. After pretraining on ImageNet-22K, these figures improved further to 86.5% and 87.3%.
Object Detection: Utilized in conjunction with Mask R-CNN, FocalNets provided a significant boost, outperforming the Swin model by 2.1 points in base model size, even surpassing Swin trained with longer schedules.
Semantic Segmentation: With UPerNet, FocalNets surpassed Swin once more, achieving a notable improvement in single-scale settings.

Notably, on ADE20K semantic segmentation and COCO Panoptic Segmentation tasks, FocalNets established new state-of-the-art results when paired with advanced methods such as Mask2former and DINO.

Theoretical and Practical Implications

The research challenges existing norms by positing that SA can be wholly substituted in vision models without incurring performance penalties. This opens pathways for more computationally efficient models given the linear complexities compared to SA's quadratic complexities over tokens. As a result, FocalNets offer a compelling alternative for tasks with high-resolution inputs.

From a theoretical standpoint, the focal modulation mechanism inherently supports translation invariance and input-dependent interaction, presenting new avenues for architecture designs in vision. FocalNets also indicate a shift toward architectures that integrate layered context via convolutional operations—balancing the scales of global awareness and local specificity.

Future Developments

The potential applications of Focal Modulation extend beyond vision tasks, with possibilities in NLP and multi-modal learning. Future work might explore cross-modulation strategies, thereby enhancing multi-modal learning frameworks. Additionally, optimizing architecture configurations to improve performance further, and to seamlessly integrate into pipeline architectures for more complex AI systems, represents a promising area of ongoing research.

In conclusion, the introduction of Focal Modulation Networks offers a substantive enhancement over current state-of-the-art methods in vision modeling, promising efficiency and adaptability across diverse application domains.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Jianwei Yang (93 papers)
Chunyuan Li (122 papers)
Xiyang Dai (53 papers)
Lu Yuan (130 papers)
Jianfeng Gao (344 papers)

Citations (196)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - microsoft/FocalNet: [NeurIPS 2022] Official code for "Focal Modulation Networks" (684 stars)