Learned Queries for Efficient Local Attention (2112.11435v2)

Published 21 Dec 2021 in cs.CV

Abstract: Vision Transformers (ViT) serve as powerful vision models. Unlike convolutional neural networks, which dominated vision research in previous years, vision transformers enjoy the ability to capture long-range dependencies in the data. Nonetheless, an integral part of any transformer architecture, the self-attention mechanism, suffers from high latency and inefficient memory utilization, making it less suitable for high-resolution input images. To alleviate these shortcomings, hierarchical vision models locally employ self-attention on non-interleaving windows. This relaxation reduces the complexity to be linear in the input size; however, it limits the cross-window interaction, hurting the model performance. In this paper, we propose a new shift-invariant local attention layer, called query and attend (QnA), that aggregates the input locally in an overlapping manner, much like convolutions. The key idea behind QnA is to introduce learned queries, which allow fast and efficient implementation. We verify the effectiveness of our layer by incorporating it into a hierarchical vision transformer model. We show improvements in speed and memory complexity while achieving comparable accuracy with state-of-the-art models. Finally, our layer scales especially well with window size, requiring up-to x10 less memory while being up-to x5 faster than existing methods. The code is publicly available at \url{https://github.com/moabarar/qna}.

PDF Abstract

Learned Queries for Efficient Local Attention

The paper "Learned Queries for Efficient Local Attention" introduces a novel local attention layer, named Query and Attend (QnA), aimed at enhancing Vision Transformers (ViTs) by addressing their inherent limitations in processing high-resolution images. ViTs have gained prominence for their ability to capture long-range dependencies in data, yet traditional self-attention mechanisms within these models suffer from high computational complexity and inefficient memory usage. The proposed QnA approach seeks to mitigate these challenges by introducing learned queries, facilitating fast and efficient local attention analogous to convolutional operations.

Methodological Innovations

QnA Layer Design: The QnA layer draws inspiration from convolutional operations to introduce shift-invariance and locality. Unlike standard self-attention that computes queries directly from the input, QnA employs learned queries shared across windows. This design allows the model to reduce the computational complexity from quadratic to linear in the context of window size, enhancing scalability and efficiency.
Utilizing Locality and Overlapping Windows: The QnA mechanism aggregates input in an overlapping manner, similar to convolutions, while maintaining the expressiveness of attention mechanisms. This approach supports efficient processing of high-resolution inputs by constraining self-attention operations to smaller, overlapping windows.
Multiple Queries for Richer Feature Spaces: The model can employ multiple learned queries to capture richer feature subspaces. This extension introduces minimal computational overhead while significantly enhancing the expressiveness of the attention layer.

Empirical Evaluation

The effectiveness of the QnA layer is demonstrated across multiple experiments, showcasing its superior trade-offs in performance vs. efficiency. Specifically, when integrated into hierarchical vision transformer architectures, QnA delivers competitive accuracy relative to state-of-the-art models while significantly reducing memory requirements and accelerating inference times.

Image Classification: On the ImageNet-1K benchmark, models leveraging QnA achieve notable throughput improvements while maintaining comparable accuracy. For instance, the QnA-base model reports up to a 2x increase in inference speed over competing architectures, managing to sustain top-1 accuracy.
Object Detection: Incorporating QnA into the DETR framework for object detection demonstrates notable improvements in handling small objects, emphasizing the utility of efficient local attention mechanisms in downstream tasks.
Architecture Flexibility: The QnA layer's design allows it to operate seamlessly as both an up-sampling and down-sampling mechanism, facilitating its application in diverse tasks, including semantic segmentation and image synthesis.

Implications and Future Directions

The QnA layer provides a compelling solution to some of the computational bottlenecks associated with Vision Transformers. By lowering the memory and processing demands without compromising on model accuracy, it paves the way for more efficient deployment of transformer-based models in industrial and real-time applications. Furthermore, the introduction of learned queries as a mechanism to inject shift-invariance and adaptivity into attention layers can spur further exploration into hybrid architectures that combine the strengths of both convolutional networks and transformers.

In future research, the exploration of larger receptive fields and the employment of neural architecture search techniques might yield additional improvements in efficiency and accuracy. The QnA layer's demonstrated versatility suggests that continued refinement and integration into broader vision frameworks could see accelerative innovation in computer vision models' performance capabilities.

This work reinvigorates the pursuit of adaptable and efficient model architectures, advocating for an optimal balance between the rich representation capabilities of transformers and the computational efficiency akin to convolutional neural networks.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Moab Arar (13 papers)
Ariel Shamir (46 papers)
Amit H. Bermano (46 papers)

Citations (26)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - moabarar/qna: [CVPR2022 - Oral] Official Jax Implementation of Learned Queries for Efficient Local Attention (119 stars)