Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person Re-Identification (2007.09314v1)

Published 18 Jul 2020 in cs.CV

Abstract: Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem. Due to the large intra-class variations and cross-modality discrepancy with large amount of sample noise, it is difficult to learn discriminative part features. Existing VI-ReID methods instead tend to learn global representations, which have limited discriminability and weak robustness to noisy images. In this paper, we propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID. We propose an intra-modality weighted-part attention module to extract discriminative part-aggregated features, by imposing the domain knowledge on the part relationship mining. To enhance robustness against noisy samples, we introduce cross-modality graph structured attention to reinforce the representation with the contextual relations across the two modalities. We also develop a parameter-free dynamic dual aggregation learning strategy to adaptively integrate the two components in a progressive joint training manner. Extensive experiments demonstrate that DDAG outperforms the state-of-the-art methods under various settings.

Citations (273)

View on Semantic Scholar

Summary

The paper introduces the Dynamic Dual-Attentive Aggregation Learning (DDAG) framework, utilizing both intra-modality part attention and cross-modality graph attention integrated through a dynamic strategy.
The DDAG framework achieved state-of-the-art results on the SYSU-MM01 dataset (54.75% rank-1, 53.02% mAP) and the RegDB dataset (69.34% rank-1, 63.46% mAP visible-to-infrared).
This research significantly advances cross-modality Re-ID techniques and demonstrates the potential of sophisticated attention models for other multimodal and multitask learning problems.

An Overview of Dynamic Dual-Attentive Aggregation Learning for VI-ReID

Visible-infrared person re-identification (VI-ReID) remains a significant computational challenge due to its inherent cross-modality nature, which requires matching visible-light imagery with infrared imagery. This paper introduces a novel framework, Dynamic Dual-Attentive Aggregation Learning (DDAG), which addresses the intricacies of VI-ReID by leveraging an innovative dual-attention mechanism.

Key Contributions

The paper delineates several notable contributions in the VI-ReID domain through the DDAG framework:

Intra-modality Weighted-Part Attention (IWPA): This component focuses on improving discriminative part feature representations within each modality by exploiting contextual relationships among body parts. The IWPA employs a part-attention mechanism followed by a residual BatchNorm integration, producing robust part-aggregated features immune to background clutter.
Cross-modality Graph Structured Attention (CGSA): Complementing IWPA, the CGSA component utilizes graph-based attention to reinforce feature learning by capturing cross-modality contextual relations. By treating samples as nodes in a graph and computing attentive weights among them, the model effectively mitigates the discrepancy between visible and infrared modalities.
Dynamic Dual Aggregation Learning: To seamlessly integrate the IWPA and CGSA in an end-to-end framework, a dynamic dual aggregation strategy is adopted. This strategy organizes the learning process by prioritizing instance-level part aggregation to stabilize training, progressively introducing graph-level attention to refine representations as training proceeds.

Numerical Evaluation

The proposed DDAG framework demonstrates its superiority over existing methods through rigorous experiments. On the SYSU-MM01 dataset, DDAG achieves a rank-1 accuracy of 54.75% and a mean Average Precision (mAP) of 53.02%. Similarly, on the RegDB dataset, DDAG attains a remarkable rank-1 accuracy of 69.34% with a mAP of 63.46% in visible-to-infrared settings. These results surpass several state-of-the-art techniques such as AGW, AlignGAN, and Xmodal, showcasing the robustness and efficacy of the proposed method.

Implications and Future Work

The implications of this research are significant in advancing cross-modality Re-ID technologies. The dual-attentive mechanism not only improves accuracy but also potentially influences other multimodal learning tasks. Future research could explore extending these ideas to other domain adaptation problems in computer vision, potentially including scenarios like video analysis and autonomous vehicle perception under varying lighting conditions.

Furthermore, the dynamic training strategy paves the way for improved multitask learning frameworks, especially in areas where the quality of one task's result influences another. The residual BatchNorm application in attention mechanisms might also encourage further exploration in noise-robust representation learning.

Conclusion

In conclusion, the introduction of a dynamic dual-attentive aggregation framework presents a compelling solution to the challenges posed by VI-ReID. By innovatively leveraging intra-modality part relationships and cross-modality graphs, the DDAG framework underscores the potential for sophisticated attention models to facilitate more accurate and robust person re-identification across disparate sensing modalities.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now