This paper introduces Deformable DETR, an object detection model designed to address key limitations of the original DETR (Detection Transformer) model (Zhu et al., 2020 ). The primary issues with DETR were its slow convergence during training (requiring hundreds of epochs) and its relatively poor performance on small objects due to the high computational and memory complexity of Transformer attention mechanisms when applied to high-resolution feature maps.
The core innovation of Deformable DETR is the Deformable Attention Module. Unlike standard Transformer attention which computes attention weights between a query and all key elements (pixels) in a feature map, the deformable attention module only attends to a small, fixed number of key sampling points around a reference point for each query. The locations of these sampling points (offsets relative to the reference point) and their corresponding attention weights are learned dynamically based on the query feature itself via linear projections. This sparse sampling mechanism significantly reduces the computational complexity and memory footprint, especially for high-resolution features.
Mathematically, the single-scale Deformable Attention is defined as:
$\text{DeformAttn}(_q, _q, ) = \sum_{m=1}^{M} _m \big[\sum_{k=1}^{K} A_{mqk} \cdot '_m (_q + \Delta_{mqk})\big]$
where is the query feature, is the 2D reference point, is the input feature map, is the number of attention heads, is the number of sampled keys per head, are the learned sampling offsets, are the learned attention weights, and are learnable weight matrices. Bilinear interpolation is used to handle fractional coordinates .
This concept is extended to Multi-scale Deformable Attention, allowing the module to aggregate features from different levels of a feature pyramid simultaneously without requiring an explicit structure like FPN. It samples points per feature level , and the attention weights are normalized across all points.
$\text{MSDeformAttn}(_q, \hat{}_q, \{^l\}_{l=1}^{L}) = \sum_{m=1}^{M} _m \big[\sum_{l=1}^{L} \sum_{k=1}^{K} A_{mlqk} \cdot '_m ^l(\phi_{l}(\hat{}_q) + \Delta_{mlqk})\big]$
Here, is the normalized reference coordinate, is the feature map at level , and maps normalized coordinates to the specific level's coordinates.
In the Deformable DETR architecture:
- Encoder: The standard Transformer self-attention modules are replaced with multi-scale deformable attention modules. Multi-scale feature maps ( to outputs plus an additional downsampled map from ) are fed directly into the encoder. The module's ability to sample across scales handles multi-scale feature fusion. Positional embeddings are supplemented with learnable scale-level embeddings. The complexity becomes linear with respect to spatial size .
- Decoder: The cross-attention modules (where object queries attend to image features) are replaced with multi-scale deformable attention. The self-attention modules (where object queries attend to each other) remain standard Transformer attention. The reference points for the deformable cross-attention are predicted from the object query embeddings. Bounding boxes are predicted as relative offsets to these reference points, simplifying optimization.
The paper also explores two variants to further improve performance:
- Iterative Bounding Box Refinement: Each decoder layer refines the bounding box predictions from the previous layer, using the previous box estimate to guide the reference point and sampling offsets for the current layer's cross-attention.
- Two-Stage Deformable DETR: An encoder-only Deformable DETR first generates region proposals by treating each feature map pixel as an object query. The top-scoring proposals are then fed into the standard Deformable DETR decoder (with iterative refinement) as object queries for the second stage.
Implementation and Results:
- Experiments on the COCO dataset show Deformable DETR achieves better performance than DETR, particularly on small objects (AP improved from 22.5 for DETR-DC5 to 26.4 for Deformable DETR).
- Crucially, Deformable DETR converges significantly faster, achieving better results in 50 epochs compared to DETR's 500 epochs (e.g., 43.8 AP vs 43.3 AP for DETR-DC5, trained for 500 epochs).
- The model is computationally more efficient than DETR-DC5 in terms of training time and inference speed (19 FPS vs 12 FPS for DETR-DC5 on a V100 GPU).
- Ablation studies confirm the benefits of multi-scale features, multi-scale attention (vs. single-scale attention or deformable convolution), and increasing the number of sampling points .
- The variants with iterative refinement and the two-stage approach further boost performance (reaching 46.2 AP with ResNet-50).
- With larger backbones (ResNeXt-101 + DCN) and test-time augmentation, Deformable DETR achieves state-of-the-art results (52.3 AP on COCO test-dev).
In conclusion, Deformable DETR effectively addresses the convergence speed and small object detection limitations of DETR by introducing a sparse, dynamic, and multi-scale attention mechanism. This makes Transformer-based end-to-end object detection significantly more practical and efficient. The code is publicly available.