Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection (2203.03605v4)

Published 7 Mar 2022 in cs.CV

Abstract: We present DINO (\textbf{D}ETR with \textbf{I}mproved de\textbf{N}oising anch\textbf{O}r boxes), a state-of-the-art end-to-end object detector. % in this paper. DINO improves over previous DETR-like models in performance and efficiency by using a contrastive way for denoising training, a mixed query selection method for anchor initialization, and a look forward twice scheme for box prediction. DINO achieves $49.4$AP in $12$ epochs and $51.3$AP in $24$ epochs on COCO with a ResNet-50 backbone and multi-scale features, yielding a significant improvement of $\textbf{+6.0}$\textbf{AP} and $\textbf{+2.7}$\textbf{AP}, respectively, compared to DN-DETR, the previous best DETR-like model. DINO scales well in both model size and data size. Without bells and whistles, after pre-training on the Objects365 dataset with a SwinL backbone, DINO obtains the best results on both COCO \texttt{val2017} ($\textbf{63.2}$\textbf{AP}) and \texttt{test-dev} (\textbf{$\textbf{63.3}$AP}). Compared to other models on the leaderboard, DINO significantly reduces its model size and pre-training data size while achieving better results. Our code will be available at \url{https://github.com/IDEACVR/DINO}.

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

The paper "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection" presents substantial advancements in the field of object detection, specifically for models in the DETR (DEtection TRansformer) framework. This paper enhances DETR-like models (Detection Transformers) by introducing several innovative techniques aimed at improving both the efficiency and performance of these models.

Key Contributions

The paper introduces three major enhancements to the existing DETR architecture:

  1. Contrastive DeNoising Training: This method improves upon DN-DETR, which introduced denoising techniques to stabilize the training process of DETR models. DINO's contrastive denoising training not only stabilizes training but also introduces the concept of positive and negative samples to improve the model's ability to reject confusing or non-informative anchors. Positive queries are anchors close to ground truth (GT) boxes while negative queries are moderately noised versions expected to predict "no object".
  2. Mixed Query Selection: DINO proposes a query selection mechanism where positional queries (anchors) are initialized from features selected from the transformer's encoder output. Unlike Deformable DETR, which initializes both positional and content queries, DINO leaves the content queries as learnable parameters. This hybrid approach effectively balances the spatial priors provided by the encoder and the rich feature representation learned by the decoder.
  3. Look Forward Twice: This novel approach to iterative box refinement leverages box predictions from later layers to fine-tune earlier layers. By updating a box prediction iteratively and incorporating gradients from subsequent layers, DINO improves bounding box accuracy across the decoding layers.

Experimental Results

The paper provides extensive empirical evidence to support the efficacy of these innovations. DINO shows remarkable performance improvements over previous models on the COCO 2017 dataset.

  • 12-Epoch Evaluation: When trained for 12 epochs, DINO achieves an AP (average precision) of 49.4 with a ResNet-50 backbone and multi-scale features, representing a significant improvement (+6.0 AP) over DN-DETR. The model remains computationally efficient with a modest increase in GFLOPs.
  • Extended Training: With an extended 24 epochs of training, DINO achieves an AP of 50.4 and 51.3 for 4-scale and 5-scale models, respectively. This indicates a consistent performance gain over the state-of-the-art Object Detectors, including Deformable DETR and DN-DETR.
  • Scalability and SOTA Performance: After pre-training on the Objects365 dataset and fine-tuning on COCO using the SwinL backbone, DINO achieves a leading AP of 63.2 on COCO val2017 and 63.3 on test-dev. This surpasses all previous models, including those with significantly larger model sizes and pre-training datasets, thereby highlighting DINO’s efficiency and scalability.

Implications and Future Directions

The contributions of DINO extend beyond immediate performance enhancements. By innovating within the framework of DETR-like models, the paper challenges the conventional reliance on hand-designed components like anchor generation and non-maximum suppression (NMS) prevalent in convolutional object detectors. Additionally, the demonstrated scalability of DINO from smaller-scale datasets to larger, more complex ones opens avenues for deploying such models in practical, large-scale applications.

Theoretical implications include validating the effectiveness of introducing hard negative samples in contrastive training for object detection tasks and refining the concept of query selection and anchor box refinement. These insights may spur further research into more intricate and computationally efficient ways to leverage Transformer architectures for a variety of computer vision tasks.

Future developments in this domain could explore even more adaptive query selection techniques, potentially incorporating dynamic adjustments during training for both positional and content queries. Also, further exploration into more sophisticated denoising techniques could enhance model resilience, especially in scenarios with noisy or incomplete data.

In conclusion, the paper "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection" not only builds upon and significantly improves existing DETR frameworks but also establishes new best practices for training efficacy and model scalability in the field of object detection.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Hao Zhang (948 papers)
  2. Feng Li (286 papers)
  3. Shilong Liu (60 papers)
  4. Lei Zhang (1689 papers)
  5. Hang Su (224 papers)
  6. Jun Zhu (424 papers)
  7. Lionel M. Ni (20 papers)
  8. Heung-Yeung Shum (32 papers)
Citations (1,049)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com