Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deformable DETR: Deformable Transformers for End-to-End Object Detection (2010.04159v4)

Published 8 Oct 2020 in cs.CV

Abstract: DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10 times less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code is released at https://github.com/fundamentalvision/Deformable-DETR.

This paper introduces Deformable DETR, an object detection model designed to address key limitations of the original DETR (Detection Transformer) model (Zhu et al., 2020 ). The primary issues with DETR were its slow convergence during training (requiring hundreds of epochs) and its relatively poor performance on small objects due to the high computational and memory complexity of Transformer attention mechanisms when applied to high-resolution feature maps.

The core innovation of Deformable DETR is the Deformable Attention Module. Unlike standard Transformer attention which computes attention weights between a query and all key elements (pixels) in a feature map, the deformable attention module only attends to a small, fixed number of key sampling points around a reference point for each query. The locations of these sampling points (offsets relative to the reference point) and their corresponding attention weights are learned dynamically based on the query feature itself via linear projections. This sparse sampling mechanism significantly reduces the computational complexity and memory footprint, especially for high-resolution features.

Mathematically, the single-scale Deformable Attention is defined as:

$\text{DeformAttn}(_q, _q, ) = \sum_{m=1}^{M} _m \big[\sum_{k=1}^{K} A_{mqk} \cdot '_m (_q + \Delta_{mqk})\big]$

where q_q is the query feature, q_q is the 2D reference point, is the input feature map, MM is the number of attention heads, KK is the number of sampled keys per head, Δmqk\Delta_{mqk} are the learned sampling offsets, AmqkA_{mqk} are the learned attention weights, and m,m_m, '_m are learnable weight matrices. Bilinear interpolation is used to handle fractional coordinates q+Δmqk_q + \Delta_{mqk}.

This concept is extended to Multi-scale Deformable Attention, allowing the module to aggregate features from different levels of a feature pyramid simultaneously without requiring an explicit structure like FPN. It samples KK points per feature level LL, and the attention weights are normalized across all L×KL \times K points.

$\text{MSDeformAttn}(_q, \hat{}_q, \{^l\}_{l=1}^{L}) = \sum_{m=1}^{M} _m \big[\sum_{l=1}^{L} \sum_{k=1}^{K} A_{mlqk} \cdot '_m ^l(\phi_{l}(\hat{}_q) + \Delta_{mlqk})\big]$

Here, ^q\hat{}_q is the normalized reference coordinate, l^l is the feature map at level ll, and ϕl\phi_l maps normalized coordinates to the specific level's coordinates.

In the Deformable DETR architecture:

  1. Encoder: The standard Transformer self-attention modules are replaced with multi-scale deformable attention modules. Multi-scale feature maps (C3C_3 to C5C_5 outputs plus an additional downsampled map from C5C_5) are fed directly into the encoder. The module's ability to sample across scales handles multi-scale feature fusion. Positional embeddings are supplemented with learnable scale-level embeddings. The complexity becomes linear O(HWC2)O(HWC^2) with respect to spatial size HWHW.
  2. Decoder: The cross-attention modules (where object queries attend to image features) are replaced with multi-scale deformable attention. The self-attention modules (where object queries attend to each other) remain standard Transformer attention. The reference points ^q\hat{}_q for the deformable cross-attention are predicted from the object query embeddings. Bounding boxes are predicted as relative offsets to these reference points, simplifying optimization.

The paper also explores two variants to further improve performance:

  1. Iterative Bounding Box Refinement: Each decoder layer refines the bounding box predictions from the previous layer, using the previous box estimate to guide the reference point and sampling offsets for the current layer's cross-attention.
  2. Two-Stage Deformable DETR: An encoder-only Deformable DETR first generates region proposals by treating each feature map pixel as an object query. The top-scoring proposals are then fed into the standard Deformable DETR decoder (with iterative refinement) as object queries for the second stage.

Implementation and Results:

  • Experiments on the COCO dataset show Deformable DETR achieves better performance than DETR, particularly on small objects (APS_S improved from 22.5 for DETR-DC5 to 26.4 for Deformable DETR).
  • Crucially, Deformable DETR converges significantly faster, achieving better results in 50 epochs compared to DETR's 500 epochs (e.g., 43.8 AP vs 43.3 AP for DETR-DC5, trained for 500 epochs).
  • The model is computationally more efficient than DETR-DC5 in terms of training time and inference speed (19 FPS vs 12 FPS for DETR-DC5 on a V100 GPU).
  • Ablation studies confirm the benefits of multi-scale features, multi-scale attention (vs. single-scale attention or deformable convolution), and increasing the number of sampling points KK.
  • The variants with iterative refinement and the two-stage approach further boost performance (reaching 46.2 AP with ResNet-50).
  • With larger backbones (ResNeXt-101 + DCN) and test-time augmentation, Deformable DETR achieves state-of-the-art results (52.3 AP on COCO test-dev).

In conclusion, Deformable DETR effectively addresses the convergence speed and small object detection limitations of DETR by introducing a sparse, dynamic, and multi-scale attention mechanism. This makes Transformer-based end-to-end object detection significantly more practical and efficient. The code is publicly available.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xizhou Zhu (73 papers)
  2. Weijie Su (37 papers)
  3. Lewei Lu (55 papers)
  4. Bin Li (514 papers)
  5. Xiaogang Wang (230 papers)
  6. Jifeng Dai (131 papers)
Citations (4,282)
Youtube Logo Streamline Icon: https://streamlinehq.com