Mean Teacher DETR with Masked Feature Alignment: A Robust Domain Adaptive Detection Transformer Framework (2310.15646v5)
Abstract: Unsupervised domain adaptation object detection (UDAOD) research on Detection Transformer(DETR) mainly focuses on feature alignment and existing methods can be divided into two kinds, each of which has its unresolved issues. One-stage feature alignment methods can easily lead to performance fluctuation and training stagnation. Two-stage feature alignment method based on mean teacher comprises a pretraining stage followed by a self-training stage, each facing problems in obtaining reliable pretrained model and achieving consistent performance gains. Methods mentioned above have not yet explore how to utilize the third related domain such as target-like domain to assist adaptation. To address these issues, we propose a two-stage framework named MTM, i.e. Mean Teacher-DETR with Masked Feature Alignment. In the pretraining stage, we utilize labeled target-like images produced by image style transfer to avoid performance fluctuation. In the self-training stage, we leverage unlabeled target images by pseudo labels based on mean teacher and propose a module called Object Queries Knowledge Transfer (OQKT) to ensure consistent performance gains of the student model. Most importantly, we propose masked feature alignment methods including Masked Domain Query-based Feature Alignment (MDQFA) and Masked Token-wise Feature Alignment (MTWFA) to alleviate domain shift in a more robust way, which not only prevent training stagnation and lead to a robust pretrained model in the pretraining stage, but also enhance the model's target performance in the self-training stage. Experiments on three challenging scenarios and a theoretical analysis verify the effectiveness of MTM.
- Neural network learning: Theoretical foundations, volume 9. cambridge university press Cambridge.
- Wasserstein generative adversarial networks. In International conference on machine learning, 214–223. PMLR.
- Cross-domain car detection using unsupervised image-to-image translation: From day to night. In 2019 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE.
- Spectrally-normalized margin bounds for neural networks. Advances in neural information processing systems, 30.
- A theory of learning from different domains. Machine learning, 79: 151–175.
- Learning bounds for domain adaptation. Advances in neural information processing systems, 20.
- Domain separation networks. Advances in neural information processing systems, 29.
- Exploring object relation in mean teacher for cross-domain detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11457–11466.
- End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, 213–229. Springer.
- Harmonizing transferability and discriminability for adapting object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8869–8878.
- Learning Domain Adaptive Object Detection with Probabilistic Teacher. arXiv:2206.06293.
- Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3339–3348.
- The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3213–3223.
- Learning from Multiple Sources. Journal of Machine Learning Research, 9(8).
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
- Unbiased mean teacher for cross-domain object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4091–4101.
- Self-Attention with Cross-Lingual Position Representation. arXiv:2004.13310.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- Domain-adversarial training of neural networks. The journal of machine learning research, 17(1): 2096–2030.
- Girshick, R. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, 1440–1448.
- Improving Transferability for Domain Adaptive Detection Transformers. In Proceedings of the 30th ACM International Conference on Multimedia, 1543–1551.
- Generative adversarial networks. Communications of the ACM, 63(11): 139–144.
- Domain-adaptive pedestrian detection in thermal images. In 2019 IEEE international conference on image processing (ICIP), 1660–1664. IEEE.
- Why resnet works? residuals generalize. IEEE transactions on neural networks and learning systems, 31(12): 5349–5362.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
- Cross domain object detection by target-perceived dual branch distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9570–9580.
- Driving in the Matrix: Can Virtual Worlds Replace Human-Generated Annotations for Real World Tasks? arXiv:1610.01983.
- Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6): 84–90.
- Deep domain adaptive object detection: a survey. In 2020 IEEE Symposium Series on Computational Intelligence (SSCI), 1808–1813. IEEE.
- IR2VI: Enhanced night environmental perception by unsupervised thermal image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 1153–1160.
- Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, 21–37. Springer.
- DETRs Beat YOLOs on Real-time Object Detection. arXiv:2304.08069.
- AWADA: Attention-Weighted Adversarial Domain Adaptation for Object Detection. arXiv:2208.14662.
- Domain adaptation via transfer component analysis. IEEE transactions on neural networks, 22(2): 199–210.
- You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 779–788.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.
- Strong-weak distribution alignment for adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6956–6965.
- Semantic foggy scene understanding with synthetic data. International Journal of Computer Vision, 126: 973–992.
- Scl: Towards accurate domain adaptive object detection via gradient detach based stacked complementary losses. arXiv preprint arXiv:1911.02559.
- Very Deep Convolutional Networks for Large-Scale Image Recognition. In Bengio, Y.; and LeCun, Y., eds., 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
- A Simple Semi-Supervised Learning Framework for Object Detection. arXiv:2005.04757.
- Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7167–7176.
- Visualizing data using t-SNE. Journal of machine learning research, 9(11).
- Attention is all you need. Advances in neural information processing systems, 30.
- Exploring sequence feature alignment for domain adaptive detection transformers. In Proceedings of the 29th ACM International Conference on Multimedia, 1730–1738.
- Exploring categorical regularization for domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11724–11733.
- Cross-domain detection via graph-induced prototype alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12355–12364.
- End-to-end semi-supervised object detection with soft teacher. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3060–3069.
- BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning. arXiv:1805.04687.
- MTTrans: Cross-domain Object Detection with Mean Teacher Transformer. In Avidan, S.; Brostow, G. J.; Cissé, M.; Farinella, G. M.; and Hassner, T., eds., Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IX, volume 13669 of Lecture Notes in Computer Science, 629–645. Springer.
- DA-DETR: Domain Adaptive Detection Transformer With Information Fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 23787–23798.
- Task-specific inconsistency alignment for domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14217–14226.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, 2223–2232.
- Deformable DETR: Deformable Transformers for End-to-End Object Detection. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- Weixi Weng (2 papers)
- Chun Yuan (127 papers)