Focal Self-attention for Local-Global Interactions in Vision Transformers (2107.00641v1)

Published 1 Jul 2021 in cs.CV, cs.AI, and cs.LG

Abstract: Recently, Vision Transformer and its variants have shown great promise on various computer vision tasks. The ability of capturing short- and long-range visual dependencies through self-attention is arguably the main source for the success. But it also brings challenges due to quadratic computational overhead, especially for the high-resolution vision tasks (e.g., object detection). In this paper, we present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions. Using this new mechanism, each token attends the closest surrounding tokens at fine granularity but the tokens far away at coarse granularity, and thus can capture both short- and long-range visual dependencies efficiently and effectively. With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers on a range of public image classification and object detection benchmarks. In particular, our Focal Transformer models with a moderate size of 51.1M and a larger size of 89.8M achieve 83.5 and 83.8 Top-1 accuracy, respectively, on ImageNet classification at 224x224 resolution. Using Focal Transformers as the backbones, we obtain consistent and substantial improvements over the current state-of-the-art Swin Transformers for 6 different object detection methods trained with standard 1x and 3x schedules. Our largest Focal Transformer yields 58.7/58.9 box mAPs and 50.9/51.3 mask mAPs on COCO mini-val/test-dev, and 55.4 mIoU on ADE20K for semantic segmentation, creating new SoTA on three of the most challenging computer vision tasks.

PDF Abstract

An Examination of Focal Self-Attention for Enhanced Vision Transformers

The academic landscape of computer vision has been significantly altered by the introduction of Vision Transformers (ViTs). The promise of these models lies in their ability to naturally capture both short- and long-range dependencies, a task previously managed independently through distinct model architectures. However, the self-attention mechanism intrinsic to ViTs is computationally intensive, particularly for high-resolution tasks such as object detection. In the discussed paper, the authors propose a novel approach, termed focal self-attention, integrating local and global attention methodologies into ViTs to surmount these computational challenges while enhancing performance.

Overview of Focal Self-Attention

Focal self-attention is introduced as a mechanism that amalgamates fine-grained local attention with coarse-grained global attention. This dual-focused attention allows each token to attend to its immediate neighbors with high granularity, while also considering distant tokens in a summarized form. Through this method, the paper introduces a new class of Vision Transformers, Focal Transformers, designed to efficiently capture both local and global dependencies.

The complexity of focal self-attention is optimized by partitioning an input feature map into several windows, thus reducing the number of tokens that need to be processed globally. The computational cost is thus not quadratic relative to the number of grids but rather linear concerning the window size—a significant advancement in terms of computational efficiency for high-resolution images.

Key Results

The Focal Transformer models, equipped with the focal self-attention, demonstrated superior performance metrics across various computer vision benchmarks. On the ImageNet classification task with a $224 \times 224$ input size, Focal Transformers achieved a Top-1 accuracy of 83.5% and 83.8% with moderate and larger model scales, respectively. These results surpass those of equivalent state-of-the-art ViTs.

Furthermore, when evaluated as backbone models for object detection tasks on the COCO dataset, Focal Transformers consistently outperformed SoTA Swin Transformers across six different object detection methods. The largest variant achieved box mean Average Precision (mAP) scores of 58.7/58.9 and mask mAP scores of 50.9/51.3 on the COCO mini-val/test-dev sets. Additionally, they scored 55.4 mean Intersection over Union (mIoU) for semantic segmentation on ADE20K, setting new records across these tasks.

Theoretical and Practical Implications

The proposed focal self-attention presents a comprehensive framework capable of capturing intricate local-global interactions efficiently. The results infer a robust architectural blueprint, addressing computational inefficiencies endemic to existing ViTs, while not compromising on the quality of task execution, especially at high resolutions.

One of the most salient theoretical implications of this research is its demonstration of the efficacy experienced when local and global dependencies are treated within a unified framework. Previously, the computational divide between coarse-grained global and fine-grained local attention mechanisms led to suboptimal solutions; however, the focal approach provides a seamless paradigm interlinking these elements advantageously.

Future Prospects

The adaptation of focal self-attention across even broader applications within and beyond the field of visual tasks serves as an exciting prospect. Potential research paths could explore its effectiveness and potential refinements in other domains benefiting from attention mechanisms, such as natural language processing, where long-range dependencies are likewise crucial.

There remains an opportunity to further optimize complexity and realign the parameters governing focal attention's effective scope, perhaps facilitating deployment on resource-constrained devices. Additionally, evolving the underlying architecture to include unsupervised learning capabilities could potentiate broader adoption and versatility.

Conclusion

Focal self-attention marks a significant contribution to the field of computer vision and the evolution of Transformer models. By offering an efficient means of processing high-resolution data while leveraging both local and global informatics, this paper sets a new performance baseline for ViTs, providing a robust foundation for future exploration and application in AI.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Jianwei Yang (93 papers)
Chunyuan Li (122 papers)
Pengchuan Zhang (58 papers)
Xiyang Dai (53 papers)
Bin Xiao (93 papers)
Lu Yuan (130 papers)
Jianfeng Gao (344 papers)

Citations (395)

View on Semantic Scholar