End-to-End Semi-Supervised Object Detection with Soft Teacher (2106.09018v3)

Published 16 Jun 2021 in cs.CV and cs.AI

Abstract: This paper presents an end-to-end semi-supervised object detection approach, in contrast to previous more complex multi-stage methods. The end-to-end training gradually improves pseudo label qualities during the curriculum, and the more and more accurate pseudo labels in turn benefit object detection training. We also propose two simple yet effective techniques within this framework: a soft teacher mechanism where the classification loss of each unlabeled bounding box is weighed by the classification score produced by the teacher network; a box jittering approach to select reliable pseudo boxes for the learning of box regression. On the COCO benchmark, the proposed approach outperforms previous methods by a large margin under various labeling ratios, i.e. 1\%, 5\% and 10\%. Moreover, our approach proves to perform also well when the amount of labeled data is relatively large. For example, it can improve a 40.9 mAP baseline detector trained using the full COCO training set by +3.6 mAP, reaching 44.5 mAP, by leveraging the 123K unlabeled images of COCO. On the state-of-the-art Swin Transformer based object detector (58.9 mAP on test-dev), it can still significantly improve the detection accuracy by +1.5 mAP, reaching 60.4 mAP, and improve the instance segmentation accuracy by +1.2 mAP, reaching 52.4 mAP. Further incorporating with the Object365 pre-trained model, the detection accuracy reaches 61.3 mAP and the instance segmentation accuracy reaches 53.0 mAP, pushing the new state-of-the-art.

PDF Abstract

Review of "End-to-End Semi-Supervised Object Detection with Soft Teacher"

In the domain of computer vision, semi-supervised learning has emerged as a pivotal approach to leverage unlabeled data effectively, particularly in object detection tasks where annotated data is scarce. The paper "End-to-End Semi-Supervised Object Detection with Soft Teacher" introduces a novel end-to-end methodology that optimizes object detection by simplifying the training process and enhancing the use of pseudo-labels derived from unlabeled data.

Key Contributions

The authors propose an end-to-end framework that contrasts with the prevalent multi-stage methods found in semi-supervised object detection. This approach mitigates the complexity associated with separate model training phases, introducing a process where both the student and teacher models are engaged in pseudo-labeling and detector training concurrently. This integration facilitates a robust cyclic improvement—pseudo-labels support detection training, which in turn refines pseudo-label accuracy.

Two original techniques are presented within this architecture:

Soft Teacher Mechanism: A pivotal component of the proposed framework, where the teacher model's classification probabilities are employed to weigh the classification loss of each unlabeled bounding box. This implies a more nuanced adaptation of pseudo-labels, distinguishing it from previous models relying on binary labels.
Box Jittering: An innovative method for selecting dependable pseudo boxes for learning box regression. This approach leverages the variance in the teacher model's regression outputs when boxes are slightly perturbed, serving as a reliability metric.

Strong Numerical Results

Empirically, the paper exhibits substantial improvements over prior models on the COCO benchmark across various labeling ratios. Notably:

For labeling ratios of 1\%, 5\%, and 10\%, the proposed model outperformed existing methods substantially.
When leveraging the full COCO dataset with 123,000 unlabeled images, an enhancement of +3.6 mAP from a baseline of 40.9 mAP to 44.5 mAP was achieved.
On top-performing models such as those based on the Swin Transformer, the authors report a detection accuracy improvement from 58.9 mAP to 60.4 mAP.

Theoretical and Practical Implications

The framework's ability to continuously refine pseudo-labels through the teacher-student model dynamic holds crucial theoretical implications, envisioning a simplified yet effective strategy in transferring unsupervised learning principles from classification to detection tasks. Practically, such methodologies pave the way for models trained with limited labeled data to achieve near state-of-the-art accuracy benchmarks, significantly reducing annotation labor costs.

Future Directions

While the paper lays a foundational approach for semi-supervised object detection, future exploration might delve into refining these mechanisms further, including:

Enhancing model robustness against erroneous pseudo-labels.
Benchmarking across more complex datasets beyond COCO to validate generalization capacity.
Exploring similar end-to-end architectures in real-time detection scenarios where latency plays a critical role.

In conclusion, "End-to-End Semi-Supervised Object Detection with Soft Teacher" constitutes a significant contribution to the field of computer vision, highlighting the efficacy of simplifying model training while advancing the state-of-the-art in semi-supervised learning applications. The method holds promise for further investigations and potential adaptations across various AI-driven visual recognition tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Mengde Xu (8 papers)
Zheng Zhang (486 papers)
Han Hu (196 papers)
Jianfeng Wang (149 papers)
Lijuan Wang (133 papers)
Fangyun Wei (53 papers)
Xiang Bai (221 papers)
Zicheng Liu (153 papers)

Citations (446)

View on Semantic Scholar