UP-DETR: Unsupervised Pre-training for Object Detection with Transformers (2011.09094v3)

Published 18 Nov 2020 in cs.CV

Abstract: DEtection TRansformer (DETR) for object detection reaches competitive performance compared with Faster R-CNN via a transformer encoder-decoder architecture. However, trained with scratch transformers, DETR needs large-scale training data and an extreme long training schedule even on COCO dataset. Inspired by the great success of pre-training transformers in natural language processing, we propose a novel pretext task named random query patch detection in Unsupervised Pre-training DETR (UP-DETR). Specifically, we randomly crop patches from the given image and then feed them as queries to the decoder. The model is pre-trained to detect these query patches from the input image. During the pre-training, we address two critical issues: multi-task learning and multi-query localization. (1) To trade off classification and localization preferences in the pretext task, we find that freezing the CNN backbone is the prerequisite for the success of pre-training transformers. (2) To perform multi-query localization, we develop UP-DETR with multi-query patch detection with attention mask. Besides, UP-DETR also provides a unified perspective for fine-tuning object detection and one-shot detection tasks. In our experiments, UP-DETR significantly boosts the performance of DETR with faster convergence and higher average precision on object detection, one-shot detection and panoptic segmentation. Code and pre-training models: https://github.com/dddzg/up-detr.

View on arXiv

Authors (4)

Zhigang Dai (4 papers)
Bolun Cai (11 papers)
Yugeng Lin (3 papers)
Junying Chen (26 papers)

Citations (19)

View on Semantic Scholar

Summary

Unsupervised Pre-training for Detection Transformers

The paper "Unsupervised Pre-training for Detection Transformers" introduces a novel pretext task, termed Random Query Patch Detection, applied to unsupervisedly enhance the capabilities of the Detection Transformer (DETR). DETR, recognized for its application in object detection via a transformer encoder-decoder, typically demands extensive training data and time. The proposed method aims to mitigate these limitations by pre-training transformers using unsupervised learning, analogous to successful models in NLP like BERT and GPT.

Key Contributions

The central contribution of the paper is the Unsupervised Pre-training DETR (UP-DETR) model, which incorporates random patch detection for pre-training. The model is designed to address significant challenges in DETR training:

Multi-task Learning Optimization: The paper identifies the necessity of balancing classification and localization tasks. It demonstrates that maintaining a frozen CNN backbone during pre-training is crucial. This approach aids in preserving the discriminative features of classification while focusing on spatial localization learning.
Multi-query Localization: The authors develop a multi-query patch detection mechanism enabling simultaneous detection of multiple patches, utilizing an attention mask to enhance performance.

The paper's experiments show that UP-DETR significantly enhances DETR's performance with shorter convergence times and improved average precision for various detection tasks, including object detection and panoptic segmentation. This is particularly notable on datasets with limited training samples, like PASCAL VOC, where UP-DETR reduces the gap with established models like Faster R-CNN.

Experimental Results

The empirical evaluations highlight UP-DETR's impact:

PASCAL VOC: UP-DETR achieves higher precision and convergence speed, significantly outperforming DETR with gains of up to 6.2 AP after 150 epochs and reaching comparable results to Faster R-CNN.
COCO: On the extensive COCO dataset, UP-DETR surpasses DETR even with prolonged training schedules, reflecting its adaptability to large data volumes.
One-shot Detection: UP-DETR demonstrates marked improvements over DETR, achieving state-of-the-art results by leveraging its pre-training strategy effectively.

In terms of panoptic segmentation, UP-DETR exhibits enhanced performance metrics compared to other methods, illustrating its broad applicability and robustness.

Implications and Future Directions

The research implies significant ramifications for transformer-based detection models. By reducing reliance on vast supervised datasets and lengthy training periods, UP-DETR proposes a more efficient paradigm for object detection tasks. It opens potential pathways for integrating spatial localization-focused pre-training with existing methods like contrastive learning, fostering more comprehensive pre-training frameworks.

The paper also suggests future exploration into unified frameworks that seamlessly integrate CNN and transformer modules within a cohesive pre-training strategy, potentially leading to further gains in various computer vision tasks, including few-shot detection and visual tracking.

In conclusion, "Unsupervised Pre-training for Detection Transformers" presents a compelling approach to overcoming inherent challenges in transformer-based object detection models, paving the way for more efficient and versatile applications in AI.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - dddzg/up-detr: [TPAMI 2022 & CVPR2021 Oral] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers (485 stars)