Multi-scale Deep Learning Architectures for Person Re-identification (1709.05165v1)

Published 15 Sep 2017 in cs.CV

Abstract: Person Re-identification (re-id) aims to match people across non-overlapping camera views in a public space. It is a challenging problem because many people captured in surveillance videos wear similar clothes. Consequently, the differences in their appearance are often subtle and only detectable at the right location and scales. Existing re-id models, particularly the recently proposed deep learning based ones match people at a single scale. In contrast, in this paper, a novel multi-scale deep learning model is proposed. Our model is able to learn deep discriminative feature representations at different scales and automatically determine the most suitable scales for matching. The importance of different spatial locations for extracting discriminative features is also learned explicitly. Experiments are carried out to demonstrate that the proposed model outperforms the state-of-the art on a number of benchmarks

Citations (267)

View on Semantic Scholar

Summary

The paper introduces MuDeep, a multi-scale deep model using a Siamese architecture with saliency-based fusion for improved person re-identification.
It employs multi-scale stream layers to extract both global and local features, enhancing the model's ability to differentiate similar appearances.
Experimental results on datasets like CUHK03 and CUHK01 demonstrate superior Rank-1 accuracy compared to traditional deep re-id models.

Analysis of "Multi-scale Deep Learning Architectures for Person Re-identification"

The paper "Multi-scale Deep Learning Architectures for Person Re-identification" presents a novel approach to improve the performance of person re-identification systems. The authors propose a multi-scale deep learning model, referred to as MuDeep, which aims to address the challenges posed by existing single-scale models by incorporating multi-scale feature learning and saliency-based attention mechanisms.

Methodology

The proposed MuDeep architecture is fundamentally structured around a Siamese network, tailored to learn discriminate features across multiple spatial scales. It introduces two critical innovations: multi-scale stream layers and a saliency-based learning fusion layer.

Multi-scale Stream Layers: These layers aim to capture various spatial features by analyzing person images at multiple scales. They enable the model to extract distinguishing global and local features, crucial for identifying subtle differences when subjects wear similar clothing.
Saliency-based Learning Fusion Layer: This component selectively emphasizes informative channels, determined by the saliency of extracted features. Such a mechanism allows for automatic weighting of different scales based on their relevance, enhancing the discriminative power of the learned features.

The architecture leverages these components to jointly optimize feature representation learning and distance metric learning, a basis shared with many deep re-id models. The MuDeep's innovative layers facilitate an end-to-end training process that integrates a verification loss with intermediate classification losses to ensure robust feature learning across multiple scales.

Experimental Evaluation

The authors conduct comprehensive experiments on standard benchmarks like CUHK03, CUHK01, and VIPeR. The empirical results underscore the effectiveness of the MuDeep model, showing superior performance compared to both traditional and existing deep learning-based approaches. Specifically, MuDeep achieved notable gains in Rank-1 accuracy on datasets like CUHK03-Detected and CUHK01, demonstrating its capability to leverage scale and saliency for enhanced re-id performance.

Implications and Future Directions

The introduction of multi-scale learning in MuDeep lays groundwork for future research in person re-identification, emphasizing the model’s potential in scenarios with visual ambiguity caused by similar attire. Moreover, the success of the saliency-based fusion layer signals promising applications in other domains where disentangling relevant features from noise is critical.

Looking forward, the combination of multi-scale and multi-resolution methodologies might yield further advancements. Furthermore, integrating temporal dynamics explicitly, as in video tracking, could benefit from extending MuDeep with recurrent architectures or incorporating unsupervised learning paradigms for scalability across varied surveillance environments.

In conclusion, this paper proposes a refined approach to person re-identification, bridging gaps in existing methodologies through novel architectural components. The MuDeep model signifies a meaningful contribution to deep learning-based computer vision, demonstrating excellent utility in enhancing the robustness of automated surveillance systems.

PDF Markdown