DEIM: DETR with Improved Matching for Fast Convergence (2412.04234v1)

Published 5 Dec 2024 in cs.CV and cs.AI

Abstract: We introduce DEIM, an innovative and efficient training framework designed to accelerate convergence in real-time object detection with Transformer-based architectures (DETR). To mitigate the sparse supervision inherent in one-to-one (O2O) matching in DETR models, DEIM employs a Dense O2O matching strategy. This approach increases the number of positive samples per image by incorporating additional targets, using standard data augmentation techniques. While Dense O2O matching speeds up convergence, it also introduces numerous low-quality matches that could affect performance. To address this, we propose the Matchability-Aware Loss (MAL), a novel loss function that optimizes matches across various quality levels, enhancing the effectiveness of Dense O2O. Extensive experiments on the COCO dataset validate the efficacy of DEIM. When integrated with RT-DETR and D-FINE, it consistently boosts performance while reducing training time by 50%. Notably, paired with RT-DETRv2, DEIM achieves 53.2% AP in a single day of training on an NVIDIA 4090 GPU. Additionally, DEIM-trained real-time models outperform leading real-time object detectors, with DEIM-D-FINE-L and DEIM-D-FINE-X achieving 54.7% and 56.5% AP at 124 and 78 FPS on an NVIDIA T4 GPU, respectively, without the need for additional data. We believe DEIM sets a new baseline for advancements in real-time object detection. Our code and pre-trained models are available at https://github.com/ShihuaHuang95/DEIM.

PDF HTML Abstract

Overview and Contributions

The DEIM framework extends the DETR family by introducing a Dense One-to-One (O2O) matching strategy designed to accelerate the convergence of Transformer-based object detectors. The primary contributions lie in mitigating the sparse supervision issue inherent in standard one-to-one matching used by DETR and incorporating auxiliary techniques to enhance sample quality. The framework integrates improved matching strategies with novel loss functions, leading to significant gains in both training speed and detection performance, as validated on the COCO benchmark.

Dense O2O Matching Strategy

One of the central innovations introduced by DEIM is the Dense O2O matching strategy. Unlike traditional DETR, which relies on a single positive match per ground truth instance, this approach increases the number of positive samples per image by incorporating additional candidate targets through standard data augmentation techniques. This strategy provides denser supervisory signals during training. However, the increased number of matches inevitably results in numerous lower quality or ambiguous matches. To counteract the potential adverse effects of these lower-quality matches, the framework systematically incorporates mechanisms for evaluating match quality.

Matchability-Aware Loss (MAL)

To address the issue of variable match quality arising from the Dense O2O approach, DEIM introduces the Matchability-Aware Loss (MAL). The MAL is designed to weigh matches based on their quality, effectively differentiating between high-confidence, high-quality matches and those that are less reliable. This adaptive weighting mitigates the risk of degrading the training signal due to noisy matches. By explicitly modeling matchability within the loss function, the proposed loss ensures that even with denser positive samples the model focuses on the most informative samples during training. This formulation results in not only faster convergence but also in robust performance across various object detection scenarios.

Experimental Results on COCO

Extensive experiments on the COCO dataset demonstrate the effectiveness of DEIM. Key performance metrics highlighted include:

Training Efficiency: DEIM reduces the training time by approximately 50% compared to standard DETR models. The faster convergence is especially evident when integrating the framework into real-time object detectors.
Performance Enhancements: When coupled with RT-DETRv2, DEIM achieves up to 53.2% AP with only one day of training on an NVIDIA 4090 GPU.
Real-Time Performance: The DEIM-D-FINE variants report substantial improvements in both accuracy and inference speed. In particular, DEIM-D-FINE-L and DEIM-D-FINE-X yield 54.7% and 56.5% AP at 124 FPS and 78 FPS respectively, when deployed on an NVIDIA T4 GPU.

These results indicate that in practical deployment scenarios, DEIM-trained models not only train faster but also outperform existing real-time detection approaches without requiring additional data.

Implementation Considerations

From an implementation standpoint, integrating DEIM with existing DETR-based systems involves a few key considerations:

Data Augmentation: The Dense O2O matching strategy leverages standard data augmentation techniques to generate additional targets. Ensuring robust augmentation pipelines is critical to maintain a balanced mix of high versus low-quality samples.
Loss Function Integration: The MAL requires careful tuning of weighting factors to appropriately balance match quality. Developers must experiment with different hyperparameters to optimize performance for their specific datasets and computational configurations.
Resource Requirements: Although DEIM significantly reduces training time, the increased number of positive samples may demand higher memory throughput during each training iteration. Utilizing high-performance GPUs, such as the NVIDIA 4090 or T4, can alleviate some of these computational challenges.
Modular Integration: Given the modular structure of the DEIM framework, integrating it with extensions like RT-DETR and D-FINE should follow a systematic update of the matching and loss computation components. For instance, augmenting the existing DETR training loop with additional match quality evaluation and adaptive loss weighting modules is essential for realizing the improvements offered by DEIM.

Deployment and Scaling Considerations

In practical deployments, balancing training efficiency with inference speed is critical. The DEIM approach has demonstrated clear advantages for real-time detection systems by providing high AP scores at competitive FPS rates. When scaling to larger datasets or more complex detection scenarios, leveraging distributed training frameworks and optimizing data pipelines can further enhance the benefits provided by the dense matching strategy. Rigorous validation on in-house datasets prior to deployment is recommended to ensure that the matchability-aware approach generalizes well across varied operational conditions.

Overall, the DEIM framework represents a significant step forward in addressing convergence issues in Transformer-based object detectors through its Dense O2O matching and Matchability-Aware Loss. Its robust performance on the COCO dataset and compatibility with real-time detection systems make it a viable candidate for both research and industrial application in modern computer vision pipelines.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Shihua Huang (14 papers)
Zhichao Lu (52 papers)
Xiaodong Cun (61 papers)
Yongjun Yu (1 paper)
Xiao Zhou (83 papers)
Xi Shen (46 papers)

Related Papers

Find Related Papers

GitHub

GitHub - ShihuaHuang95/DEIM: It is an advanced training framework for SoTA real-time DETRs. (11 stars)

Tweets

https://twitter.com/Chandra88Moon/status/1885508399799906661

https://twitter.com/jbohnslav/status/1865054766184960133