DETRs with Collaborative Hybrid Assignments Training (2211.12860v6)

Published 22 Nov 2022 in cs.CV

Abstract: In this paper, we provide the observation that too few queries assigned as positive samples in DETR with one-to-one set matching leads to sparse supervision on the encoder's output which considerably hurt the discriminative feature learning of the encoder and vice visa for attention learning in the decoder. To alleviate this, we present a novel collaborative hybrid assignments training scheme, namely $\mathcal{C}$o-DETR, to learn more efficient and effective DETR-based detectors from versatile label assignment manners. This new training scheme can easily enhance the encoder's learning ability in end-to-end detectors by training the multiple parallel auxiliary heads supervised by one-to-many label assignments such as ATSS and Faster RCNN. In addition, we conduct extra customized positive queries by extracting the positive coordinates from these auxiliary heads to improve the training efficiency of positive samples in the decoder. In inference, these auxiliary heads are discarded and thus our method introduces no additional parameters and computational cost to the original detector while requiring no hand-crafted non-maximum suppression (NMS). We conduct extensive experiments to evaluate the effectiveness of the proposed approach on DETR variants, including DAB-DETR, Deformable-DETR, and DINO-Deformable-DETR. The state-of-the-art DINO-Deformable-DETR with Swin-L can be improved from 58.5% to 59.5% AP on COCO val. Surprisingly, incorporated with ViT-L backbone, we achieve 66.0% AP on COCO test-dev and 67.9% AP on LVIS val, outperforming previous methods by clear margins with much fewer model sizes. Codes are available at \url{https://github.com/Sense-X/Co-DETR}.

PDF Abstract

DETRs with Collaborative Hybrid Assignments Training: An Overview

The paper "DETRs with Collaborative Hybrid Assignments Training" by Zhuofan Zong, Guanglu Song, and Yu Liu presents a novel approach to enhance DETR-based detectors through a collaborative hybrid assignments training scheme called $\mathcal{C}$ o-DETR. This approach addresses a fundamental drawback observed in the sparse supervision nature of one-to-one set matching within the DEtection TRansformer (DETR) framework.

The authors identify that the limited number of queries assigned as positive samples in DETR results in sparse supervision on the output of the encoder, which impedes the learning of discriminative features. To mitigate this limitation, the paper introduces $\mathcal{C}$ o-DETR, which collaboratively trains multiple parallel auxiliary heads using versatile one-to-many label assignment strategies, such as those found in ATSS and Faster R-CNN, to enhance the encoder's learning capabilities.

Methodology

The methodology underpinning $\mathcal{C}$ o-DETR can be summarized as follows:

Collaborative Hybrid Assignments Training:
- Multiple auxiliary heads are integrated with the output from the transformer encoder.
- These auxiliary heads are supervised using one-to-many label assignments, enriching the supervision of the encoder's output and making it more discriminative.
- The training process ensures that the encoder can support the convergence of the auxiliary heads, therefore improving the overall feature learning.
Customized Positive Queries Generation:
- Positive samples generated by the auxiliary heads are used to create customized positive queries for the decoder.
- These positive queries are intended to improve the training efficiency of the decoder by introducing multiple groups of positive queries, each aligned with specific ground-truth categories and bounding boxes.

The collaborative hybrid assignments and the customized positive queries work synergistically to enhance both the encoder's feature learning and the cross-attention learning in the decoder, which is central to solving the inefficiencies arising from the sparse supervision in one-to-one set matching.

Experimental Results

The efficacy of $\mathcal{C}$ o-DETR is empirically validated on several DETR variants, including DAB-DETR, Deformable-DETR, and DINO-Deformable-DETR. Noteworthy improvements include:

DINO-Deformable-DETR with Swin-L: Enhanced from 58.5% to 59.5% AP on COCO val.
ViT-L Backbone: Achieved 66.0% AP on COCO test-dev and 67.9% AP on LVIS val, indicating substantial improvements over previous state-of-the-art methods with significantly fewer model parameters.

The improvement in training efficiency and effectiveness is illustrated through various empirical metrics, including faster convergence and higher average precision (AP) scores. For instance, Deformable-DETR showed an improvement from 37.1% to 42.9% AP within 12 epochs.

Theoretical and Practical Implications

The proposed $\mathcal{C}$ o-DETR approach has numerous implications:

Encoder Efficiency: By increasing the number of positive samples used in training, $\mathcal{C}$ o-DETR significantly alleviates the inefficient training caused by sparse supervision. This results in more robust feature representations that are critical for complex object detection tasks.
Decoder Optimization: Customized positive queries contribute to more effective cross-attention learning, which is essential for object detection models relying on transformer-based architectures. The enhanced discriminability of encoder features translates into more efficient and accurate object detection.
Scalability: The approach is scalable and demonstrates significant performance gains with larger backbone models, such as ViT-L and Swin-L, indicating its potential applicability in extensive, real-world object detection scenarios.

Future Directions

Looking forward, several areas could build upon the contributions of this paper:

Exploration of Different Label Assignments: Further exploration and optimization of different one-to-many label assignments could yield additional gains in efficiency and performance.
Applicability to Other Vision Tasks: Extending the collaborative hybrid assignment training to other computer vision tasks, such as instance segmentation or keypoint detection, could provide a comprehensive framework for end-to-end training of various detection models.
Real-Time Applications: Investigate real-time applications of $\mathcal{C}$ o-DETR in tasks such as autonomous driving or real-time surveillance, where the balance between accuracy and computational efficiency is critical.

In summary, by introducing a novel training scheme that leverages collaborative hybrid assignments, the authors significantly improve the training efficiency and effectiveness of DETR-based detectors. The strong empirical results, combined with the theoretical advancements, mark $\mathcal{C}$ o-DETR as a notable contribution to the field of object detection in computer vision.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Zhuofan Zong (14 papers)
Guanglu Song (45 papers)
Yu Liu (784 papers)

Citations (226)

View on Semantic Scholar

DETRs with Collaborative Hybrid Assignments Training (2211.12860v6)

DETRs with Collaborative Hybrid Assignments Training: An Overview

Methodology

Experimental Results

Theoretical and Practical Implications

Future Directions

Related Papers

GitHub

YouTube