Detection Transformer with Stable Matching (2304.04742v1)

Published 10 Apr 2023 in cs.CV

Abstract: This paper is concerned with the matching stability problem across different decoder layers in DEtection TRansformers (DETR). We point out that the unstable matching in DETR is caused by a multi-optimization path problem, which is highlighted by the one-to-one matching design in DETR. To address this problem, we show that the most important design is to use and only use positional metrics (like IOU) to supervise classification scores of positive examples. Under the principle, we propose two simple yet effective modifications by integrating positional metrics to DETR's classification loss and matching cost, named position-supervised loss and position-modulated cost. We verify our methods on several DETR variants. Our methods show consistent improvements over baselines. By integrating our methods with DINO, we achieve 50.4 and 51.5 AP on the COCO detection benchmark using ResNet-50 backbones under 12 epochs and 24 epochs training settings, achieving a new record under the same setting. We achieve 63.8 AP on COCO detection test-dev with a Swin-Large backbone. Our code will be made available at https://github.com/IDEA-Research/Stable-DINO.

PDF Abstract

Detection Transformer with Stable Matching: A Comprehensive Overview

The paper, "Detection Transformer with Stable Matching," presents a significant contribution to the field of object detection, particularly focusing on the advancements of DEtection TRansformers (DETR). In this work, the authors investigate the stability issues associated with the matching process across different decoder layers within the DETR architecture, identifying the so-called "multi-optimization path problem" as the root cause. The crucial insight offered is the impact of unstable matching on the overall performance and convergence of DETR models.

Key Contributions and Methodological Innovations

The paper introduces two core modifications to the standard DETR setup in order to mitigate the unstable matching problem:

Position-Supervised Loss: The authors propose incorporating positional metrics such as Intersection over Union (IoU) into the classification loss and matching cost. By doing so, they enforce a coupling between localization accuracy and classification confidence, thereby aligning the optimization paths more effectively. This approach ensures that predictions with high positional accuracy are incentivized, leading to improved stability across decoder layers.
Position-Modulated Cost: Further enhancing the DETR model, a modified cost function is used during the Hungarian matching process. This novel approach modulates classification scores with positional metrics, prioritizing predictions with superior localization qualities without being overly reliant on classification probability.

Empirical Validation and Performance Metrics

The paper demonstrates the efficacy of these methods through comprehensive experiments on COCO datasets, a standard large-scale benchmark for object detection. The empirical results reveal consistent performance improvements across several DETR variants. For instance, when integrated with the DINO model, the proposed methods achieve $50.4$ AP with a ResNet-50 backbone using a $1\times$ training schedule, marking a notable advancement over previous benchmarks.

Moreover, the research also explores the potential of feature fusion methodologies. The work proposes a dense memory fusion technique, which integrates backbone and encoder features, contributing to faster convergence and enhanced utility of pre-trained encoder features.

Theoretical Implications and Future Directions

From a theoretical standpoint, the introduction of position-supervised loss emphasizes the importance of aligning task objectives directly with model optimization strategies. It also bridges some methodological gaps between DETR-like models and traditional object detectors, which typically utilize multitude positional markers for optimization. This alignment potentially ushers in a new direction for further research in unifying different paradigms of object detection.

As a prospective trajectory, future research could delve into refining these loss functions for broader applications beyond 2D object detection. Further exploration could assess the impact on 3D detection tasks or other aspects of computer vision, where precise localization plays a critical role. Additionally, the readability and interpretability of such detection models could be enhanced through these stabilization strategies, offering broader application potential in real-world scenarios.

In conclusion, this paper contributes a structured, theoretically sound, and empirically validated approach to addressing a critical limitation of DETR-based frameworks. By introducing minor yet impactful changes in loss design and matching cost, the authors significantly enhance model stability, paving the way for more robust object detection systems. The work holds substantial promise for future innovations in leveraging transformer architectures for vision tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Shilong Liu (60 papers)
Tianhe Ren (25 papers)
Jiayu Chen (51 papers)
Zhaoyang Zeng (29 papers)
Hao Zhang (948 papers)
Feng Li (286 papers)
Hongyang Li (99 papers)
Jun Huang (126 papers)
Hang Su (224 papers)
Jun Zhu (424 papers)
Lei Zhang (1689 papers)

Citations (27)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - IDEA-Research/Stable-DINO: [ICCV 2023] Official implementation of the paper "Detection Transformer with Stable Matching" (228 stars)