Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers

Published 27 Jul 2021 in cs.CV and cs.LG | (2107.12636v4)

Abstract: Detection transformers have recently shown promising object detection results and attracted increasing attention. However, how to develop effective domain adaptation techniques to improve its cross-domain performance remains unexplored and unclear. In this paper, we delve into this topic and empirically find that direct feature distribution alignment on the CNN backbone only brings limited improvements, as it does not guarantee domain-invariant sequence features in the transformer for prediction. To address this issue, we propose a novel Sequence Feature Alignment (SFA) method that is specially designed for the adaptation of detection transformers. Technically, SFA consists of a domain query-based feature alignment (DQFA) module and a token-wise feature alignment (TDA) module. In DQFA, a novel domain query is used to aggregate and align global context from the token sequence of both domains. DQFA reduces the domain discrepancy in global feature representations and object relations when deploying in the transformer encoder and decoder, respectively. Meanwhile, TDA aligns token features in the sequence from both domains, which reduces the domain gaps in local and instance-level feature representations in the transformer encoder and decoder, respectively. Besides, a novel bipartite matching consistency loss is proposed to enhance the feature discriminability for robust object detection. Experiments on three challenging benchmarks show that SFA outperforms state-of-the-art domain adaptive object detection methods. Code has been made available at: https://github.com/encounter1997/SFA.

Abstract PDF Upgrade to Chat

Citations (83)

View on Semantic Scholar

Summary

The paper introduces a Sequence Feature Alignment (SFA) method that reduces both global and local domain discrepancies in detection transformers via innovative DQFA and TDA modules.
The paper demonstrates experimentally that SFA significantly enhances performance, with Deformable DETR’s mAP improving by over 12.8% on key benchmarks.
A novel bipartite matching consistency loss further improves feature discriminability and reduces target domain prediction error, ensuring robust cross-domain detection.

Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers

The paper "Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers" addresses the challenge of enhancing the cross-domain performance of detection transformers, a task that until now has not been thoroughly explored or understood. Traditional approaches in domain adaptive object detection (DAOD) have primarily focused on adapting methods such as Faster RCNN, SSD, or FCOS through adversarial feature alignment techniques on the convolutional neural network (CNN) backbones. These methods, however, do not adequately ensure domain-invariant features in the transformer layers of detection transformers like DETR or Deformable DETR, which are essential for making robust cross-domain predictions.

Core Contributions

The principal contribution of this research is the development of a Sequence Feature Alignment (SFA) method specifically designed for detection transformers, aiming to minimize domain discrepancies in both global and local feature representations. The SFA comprises two innovative modules:

Domain Query-Based Feature Alignment (DQFA): This module introduces a novel domain query to the transformer model, which aggregates and aligns global context features from both the source and target token sequences. DQFA operates on both encoder and decoder stages, reducing domain discrepancies in global-level features and inter-object relations.
Token-Wise Feature Alignment (TDA): In contrast to DQFA, TDA focuses on the alignment of token features within sequences. It addresses domain gaps at local and instance levels by aligning features across domains in the transformers' encoder and decoder.

Moreover, a novel bipartite matching consistency loss is employed to enhance feature discriminability by ensuring robust object detection.

Experimental Insights

The experimental results underscore the efficacy of SFA across three challenging benchmarks: weather adaptation (Cityscapes to Foggy Cityscapes), synthetic to real adaptation (Sim10k to Cityscapes), and scene adaptation (Cityscapes to BDD100k). Notably, SFA consistently outperforms existing state-of-the-art DAOD methods, confirming its ability to significantly improve the cross-domain performance of detection transformers. For instance, SFA enhances the Deformable DETR's mAP by over 12.8 percent compared to baseline models on the Cityscapes to Foggy Cityscapes benchmark.

Theoretical Implications

Theoretical analysis within the study suggests that the improvements brought by SFA can be attributed to how it addresses the principal components of domain adaptation error. By effectively minimizing feature domain divergence and maintaining feature discriminability across domains, SFA reduces target domain prediction error. Furthermore, the paper introduces a covering bound for the discriminator, demonstrating how a simple discriminator can enhance generalizability in adversarial training setups, subsequently improving domain adaptation performance.

Future Directions

The study opens several pathways for future research. One key area is further optimization of sequence alignment processes to bolster feature invariance across domains without compromising detection accuracy. Additionally, leveraging these alignment strategies in conjunction with other transformer-based models in different vision tasks could be a promising direction.

In summary, "Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers" contributes significantly to domain adaptation literature by pioneering methods tailored explicitly for detection transformers, thus laying a strong foundation for subsequent advancements in the field.