XCiT: Cross-Covariance Image Transformers (2106.09681v2)

Published 17 Jun 2021 in cs.CV and cs.LG

Abstract: Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. We propose a "transposed" version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries. The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images. Our cross-covariance image transformer (XCiT) is built upon XCA. It combines the accuracy of conventional transformers with the scalability of convolutional architectures. We validate the effectiveness and generality of XCiT by reporting excellent results on multiple vision benchmarks, including image classification and self-supervised feature learning on ImageNet-1k, object detection and instance segmentation on COCO, and semantic segmentation on ADE20k.

Authors (11)

Alaaeldin El-Nouby (21 papers)
Hugo Touvron (22 papers)
Mathilde Caron (25 papers)
Piotr Bojanowski (50 papers)
Matthijs Douze (52 papers)
Armand Joulin (81 papers)
Ivan Laptev (99 papers)
Natalia Neverova (36 papers)
Gabriel Synnaeve (97 papers)
Jakob Verbeek (59 papers)
Hervé Jegou (3 papers)

Citations (456)

View on Semantic Scholar

Summary

The paper introduces Cross-Covariance Attention (XCA) to replace quadratic self-attention with a channel-wise mechanism that scales linearly with image tokens.
XCiT achieves 86.0% top-1 accuracy on ImageNet and outperforms benchmark models in object detection and semantic segmentation.
Its scalable design and resolution robustness pave the way for efficient high-resolution image processing and broader applications in computer vision.

Cross-Covariance Image Transformers

The paper presents a novel approach to transformer architectures tailored for computer vision, addressing the computational inefficiencies associated with conventional self-attention mechanisms. Traditional transformers, highly effective in NLP, face scalability challenges in vision tasks due to quadratic complexity in time and memory with respect to the number of tokens (typically image patches).

Key Contributions

Cross-Covariance Attention (XCA): The authors introduce a "transposed" attention mechanism that operates across feature channels instead of tokens. This cross-covariance attention substitutes the pairwise interactions of self-attention with self-attention among features via a cross-covariance matrix between keys and queries, consequently achieving linear complexity in the number of tokens.
Scalable Vision Transformers: The proposed Cross-Covariance Image Transformers (XCiT) combine the precision of traditional transformers with the scalability of convolutional networks. This inherently scalable design facilitates efficient processing of high-resolution images, overcoming prior limitations of applying transformers directly to such tasks.

Experimental Results

Image Classification: XCiT demonstrates competitive performance on ImageNet-1k, achieving 86.0% top-1 accuracy, outperforming existing state-of-the-art models like DeiT and CaiT with comparable parameter budgets.
Dense Prediction Tasks: For object detection and segmentation on datasets like COCO and ADE20k, XCiT either matches or surpasses traditional convolutional networks and recent transformer models. For example, it records 48.5% mAP for object detection and 48.4% mIoU for semantic segmentation, exceeding the Swin Transformer.
Self-Supervised Learning: Utilizing DINO for unsupervised feature learning, XCiT achieves a notable 80.9% top-1 accuracy on ImageNet, underscoring its versatility across supervision paradigms.

Technical Advancements

Block-Diagonal Attention: The implementation introduces a "block-diagonal" shape within XCA, partitioning features into groups (heads) to further reduce computational load. This design choice draws parallels with Group Normalization techniques, enhancing optimization.
Resolution Robustness: XCiT models exhibit robustness to variable image resolutions during training and inference, a significant advantage over previous approaches. This flexibility arises from XCA's operation over fixed-size covariance blocks, making it more adaptable to diverse input sizes.

Implications and Future Directions

The introduction of XCiT heralds advancements in large-scale image processing with transformers, proposing a viable path towards tackling computational bottlenecks. The linear scaling with token count implies potential applicability to other domains with high-resolution data, such as video analysis or medical imaging.

Future work could explore the integration of XCiT in multimodal tasks or investigate fusion with other efficient transformer methods for broader applicability and improved performance across diverse datasets.

In sum, this paper delineates a significant step towards enhancing the efficiency of vision transformers, maintaining their accuracy benefits while alleviating computational constraints, setting a precedent for future research in AI scalability.

PDF Markdown

Related Papers

GitHub

GitHub - facebookresearch/xcit: Official code Cross-Covariance Image Transformer (XCiT) (672 stars)

YouTube

Show All Videos