Gated Siamese Convolutional Neural Network Architecture for Human Re-Identification (1607.08378v2)

Published 28 Jul 2016 in cs.CV

Abstract: Matching pedestrians across multiple camera views, known as human re-identification, is a challenging research problem that has numerous applications in visual surveillance. With the resurgence of Convolutional Neural Networks (CNNs), several end-to-end deep Siamese CNN architectures have been proposed for human re-identification with the objective of projecting the images of similar pairs (i.e. same identity) to be closer to each other and those of dissimilar pairs to be distant from each other. However, current networks extract fixed representations for each image regardless of other images which are paired with it and the comparison with other images is done only at the final level. In this setting, the network is at risk of failing to extract finer local patterns that may be essential to distinguish positive pairs from hard negative pairs. In this paper, we propose a gating function to selectively emphasize such fine common local patterns by comparing the mid-level features across pairs of images. This produces flexible representations for the same image according to the images they are paired with. We conduct experiments on the CUHK03, Market-1501 and VIPeR datasets and demonstrate improved performance compared to a baseline Siamese CNN architecture.

Citations (784)

View on Semantic Scholar

Summary

The paper introduces a novel gated Siamese CNN framework that integrates feature extraction and adaptive matching via a differentiable Matching Gate.
It demonstrates performance improvements on benchmarks like Market-1501 with Rank-1 accuracy increasing from 62.32% to 65.88% in single-query settings.
The architecture enhances local feature discrimination by dynamically emphasizing pertinent features, essential for robust human re-identification in surveillance.

Gated Siamese Convolutional Neural Network Architecture for Human Re-Identification

The paper presents a novel architecture for human re-identification, leveraging a Gated Siamese Convolutional Neural Network (S-CNN) framework. Human re-identification is a crucial task in visual surveillance systems, involving the matching of pedestrians across multiple, non-overlapping camera views. Given the complex variations in illumination, pose, and appearance among camera views, this task poses significant challenges.

Key Contributions

The paper introduces several key innovations:

Baseline Siamese CNN Architecture: The authors propose a robust baseline S-CNN architecture for human re-identification, which outperforms many existing deep learning and hand-crafted feature methods. This architecture integrates feature extraction and metric learning into a unified framework, optimized using contrastive loss.
Matching Gate (MG) Mechanism: The proposed architecture improves upon traditional S-CNNs by addressing a critical limitation: the inability to adaptively emphasize pertinent, fine-grained local features during the matching process. The Matching Gate mechanism evaluates mid-level feature similarities across image pairs, selectively amplifying common local patterns.
End-to-End Learning Framework: The MG is designed as a differentiable component facilitating end-to-end training. By comparing horizontal stripe features, the network learns to enhance relevant local patterns, thus boosting the discriminative capabilities of the propagated features.

Architecture and Implementation

Baseline Siamese CNN

The baseline architecture consists of asymmetric convolutional filters aimed at preserving spatial information while progressively reducing the width of intermediate feature maps. This ensures that local features are effectively captured and propagated through the network.

Matching Gate Integration

The Matching Gate is inserted between the convolutional layers, comparing and boosting mid-level features. It performs three major functions:

Feature Summarization: Aggregates features along horizontal stripes to address pose variations, facilitating effective pairwise comparisons.
Feature Similarity Computation: Computes the Euclidean distance between summarized features, normalized using a Gaussian activation function to represent similarity scores.
Feature Filtering and Boosting: Applies the similarity scores to amplify common local features, enhancing the discriminative power of the final embeddings.

These operations result in a dynamically adaptive network capable of focusing on critical local details pertinent for distinguishing between positive and hard-negative pairs.

Experimental Results

The proposed architecture was evaluated on three benchmark datasets: Market-1501, CUHK03, and VIPeR.

Market-1501

For the Market-1501 dataset, the baseline S-CNN achieved a Rank-1 accuracy of 62.32% in single-query (SQ) settings and 72.92% in multi-query (MQ) settings. Integrating the Matching Gate mechanism further improved performance to 65.88% (SQ) and 76.04% (MQ). The Mean Average Precision (mAP) also saw notable gains, demonstrating enhanced retrieval effectiveness.

CUHK03

On the CUHK03 dataset (detected setting), the baseline achieved a Rank-1 accuracy of 58.1%, whereas the incorporation of the Matching Gate boosted performance to 61.8%. Multi-query settings showed even higher gains, with the final architecture achieving a Rank-1 accuracy of 68.1%.

VIPeR

Despite the relatively small size of the VIPeR dataset, transfer learning from larger datasets combined with the proposed architecture showed promising results. The baseline S-CNN achieved a Rank-1 accuracy of 36.2%, improving to 37.8% with the Matching Gate.

Analysis and Implications

The proposed S-CNN with the Matching Gate significantly enhances the discriminative capabilities of human re-identification systems. The ability to dynamically adjust feature importance during run-time addresses a critical gap in existing methods, enabling the network to differentiate finer local patterns that are essential for accurate identifications.

Practical Implications

In practical visual surveillance systems, enhanced retrieval accuracy has direct implications for security, enabling more reliable tracking and identification of persons of interest across large camera networks. The improvements in mAP indicate better retrieval performance, which is crucial for applications involving large-scale surveillance data.

Speculations on Future Developments

Future advancements could explore further improvements in gating mechanisms, potentially incorporating attention models to dynamically weigh feature importance more precisely. Additionally, cross-domain transfer learning could be investigated to enhance model generalization across varied surveillance environments.

In summary, the proposed Gated Siamese Convolutional Neural Network presents a substantive advancement in the field of human re-identification, with significant improvements in matching accuracy and retrieval performance. The flexibility and differentiability of the Matching Gate mechanism provide a robust framework for future research and practical implementations in visual surveillance systems.

PDF Markdown