Less is More: Focus Attention for Efficient DETR (2307.12612v1)

Published 24 Jul 2023 in cs.CV and cs.AI

Abstract: DETR-like models have significantly boosted the performance of detectors and even outperformed classical convolutional models. However, all tokens are treated equally without discrimination brings a redundant computational burden in the traditional encoder structure. The recent sparsification strategies exploit a subset of informative tokens to reduce attention complexity maintaining performance through the sparse encoder. But these methods tend to rely on unreliable model statistics. Moreover, simply reducing the token population hinders the detection performance to a large extent, limiting the application of these sparse models. We propose Focus-DETR, which focuses attention on more informative tokens for a better trade-off between computation efficiency and model accuracy. Specifically, we reconstruct the encoder with dual attention, which includes a token scoring mechanism that considers both localization and category semantic information of the objects from multi-scale feature maps. We efficiently abandon the background queries and enhance the semantic interaction of the fine-grained object queries based on the scores. Compared with the state-of-the-art sparse DETR-like detectors under the same setting, our Focus-DETR gets comparable complexity while achieving 50.4AP (+2.2) on COCO. The code is available at https://github.com/huawei-noah/noah-research/tree/master/Focus-DETR and https://gitee.com/mindspore/models/tree/master/research/cv/Focus-DETR.

PDF Abstract

Focus-DETR: Enhancing Efficiency in DETR Models

The paper presents Focus-DETR, an innovative approach aimed at optimizing Detection Transformer (DETR)-like models by refining the attention mechanisms traditionally used in these architectures. DETR models have shown exceptional promise in improving object detection tasks by leveraging the Transformer architecture for global feature interaction. However, these models often face computational inefficiencies due to indiscriminate attention distribution across all image tokens, leading to redundancy and increased computational burden.

Key Contributions and Methodology

Focus-DETR addresses these inefficiencies by introducing a focused attention paradigm that selectively enhances information-rich tokens. The core of this approach lies in the dual attention mechanism that balances between computational cost and accuracy:

Scoring Mechanism: A novel two-stage scoring mechanism is employed to ascertain the relevance of tokens, incorporating both localization and categorical semantics. This scoring is facilitated through a Foreground Token Selector (FTS) that leverages multi-scale feature maps for improved token discrimination.
Multi-category Score Predictor: This component further refines token selection by introducing additional semantic granularity, allowing the model to identify and prioritize tokens with higher objectiveness.
Dual Attention Encoder: By embedding selected tokens within a restructured encoder utilizing dual attention mechanisms, Focus-DETR enhances the interaction among fine-grained object queries. This method not only maintains the encoded feature integrity but also significantly reduces computational commitments compared to traditional sparse methods.

Numerical Results

The proposed Focus-DETR model demonstrates substantial improvements in both computational efficiency and accuracy. Specifically, under equivalent settings, Focus-DETR achieves an Average Precision (AP) of 50.4, reflecting a notable increase of 2.2 points compared to state-of-the-art sparse DETR solutions on the COCO dataset. The model essentially halved the computational load in the Transformer part, emphasizing the practical benefits of the dual attention approach. These improvements are accompanied by an apparent reduction in GFLOPs and improved FPS, indicative of enhanced inference speeds.

Implications and Future Directions

The enhancements proposed by Focus-DETR not only refine the performance of DETR-like models but also set a precedent for future sparsity-based approaches in transformers for computer vision tasks. The efficient allocation of attention resources opens avenues for real-time applications such as autonomous driving and video surveillance, where computational resources are at a premium.

Looking ahead, further exploration into hierarchical semantic grading strategies could provide additional performance gains. Exploring methods to unify semantic scoring across different Transformer layers could also drive advancements in efficiency. Given the ongoing development within the field, Focus-DETR offers a promising foundation on which more sophisticated and scalable object detection frameworks can be built.

Overall, this paper provides a valuable contribution to the landscape of efficient object detection models, offering insights that extend beyond mere computational savings to strategic advancement in model design.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Dehua Zheng (3 papers)
Wenhui Dong (8 papers)
Hailin Hu (16 papers)
Xinghao Chen (66 papers)
Yunhe Wang (145 papers)

Citations (36)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/fenchri/status/1854494010410185148