Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity (2111.14330v2)

Published 29 Nov 2021 in cs.CV and cs.LG

Abstract: DETR is the first end-to-end object detector using a transformer encoder-decoder architecture and demonstrates competitive performance but low computational efficiency on high resolution feature maps. The subsequent work, Deformable DETR, enhances the efficiency of DETR by replacing dense attention with deformable attention, which achieves 10x faster convergence and improved performance. Deformable DETR uses the multiscale feature to ameliorate performance, however, the number of encoder tokens increases by 20x compared to DETR, and the computation cost of the encoder attention remains a bottleneck. In our preliminary experiment, we observe that the detection performance hardly deteriorates even if only a part of the encoder token is updated. Inspired by this observation, we propose Sparse DETR that selectively updates only the tokens expected to be referenced by the decoder, thus help the model effectively detect objects. In addition, we show that applying an auxiliary detection loss on the selected tokens in the encoder improves the performance while minimizing computational overhead. We validate that Sparse DETR achieves better performance than Deformable DETR even with only 10% encoder tokens on the COCO dataset. Albeit only the encoder tokens are sparsified, the total computation cost decreases by 38% and the frames per second (FPS) increases by 42% compared to Deformable DETR. Code is available at https://github.com/kakaobrain/sparse-detr

PDF Abstract

Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity

The exploration of efficient object detection models has led to a series of developments in which transformers play a crucial role. Sparse DETR presents a significant innovation in this domain, addressing the hitherto prominent issue of computational inefficiency that afflicts models like DETR and its successor, Deformable DETR. This paper delineates a novel transformer-based object detector that employs a sophisticated sparsification strategy, maintaining competitive performance while substantially reducing computational demand.

Sparse DETR evolves from its predecessors by introducing a sparse update mechanism for encoder tokens. The main premise is that not all tokens require refinement to achieve high detection accuracy. This insight stems from the observation that a limited subset of encoder tokens significantly contributes to the model's decision-making. By focusing on these tokens, Sparse DETR minimizes the typical computational overhead encountered in earlier models.

The cornerstone of Sparse DETR is its encoder token sparsification method. This involves a learnable mechanism that predicts the saliency of tokens and updates only the most pertinent ones, thereby achieving desired detections with reduced computational effort. Sparse DETR distinguishes itself through a sophisticated selection process via a decoder cross-attention map (DAM) predictor. The use of DAM indices allows the model to glean which encoder outputs are most likely to be referenced during decoding, elegantly narrowing the focus to a vital subset of tokens.

Evaluating Sparse DETR on the COCO dataset illustrates its prowess: the model not only surpasses Deformable DETR when employing only 10% of encoder tokens but also does so by diminishing total computation costs by 38% and boosting frames per second by 42%. Such performance is achieved while maintaining high accuracy (e.g., an Average Precision [AP] of 48.2 with reduced computational cost).

The implications of Sparse DETR are twofold: it presents a leaner model primed for real-time applications and puts forth a foundation from which more efficient transformer-driven object detection architectures might arise. By proving that precise token selection can dramatically reduce computational requirements without sacrificing performance, Sparse DETR could spearhead a new avenue of research focused on sparsity in neural networks, particularly in computer vision.

Looking to the future, the presented techniques in Sparse DETR can be amalgamated with other innovations in transformer architecture, such as dynamic layer sparsification or attention mechanisms tailored for specific feature extraction tasks. Additionally, by leveraging pre-trained dense representations tailored to specific domains, Sparse DETR could further enhance its adaptability to various datasets and conditions.

In essence, Sparse DETR highlights a paradigm shift towards more resource-conscious AI models. By marrying performance with computational efficiency, it invites further scholarly inquiry and inspires new methodologies for transforming heavy-duty algorithms into deployable solutions in environments where computational resources are a premium.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Byungseok Roh (16 papers)
JaeWoong Shin (6 papers)
Wuhyun Shin (2 papers)
Saehoon Kim (19 papers)

Citations (117)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - kakaobrain/sparse-detr: PyTorch Implementation of Sparse DETR (165 stars)