Deep Attributes Driven Multi-Camera Person Re-identification (1605.03259v2)

Published 11 May 2016 in cs.CV

Abstract: The visual appearance of a person is easily affected by many factors like pose variations, viewpoint changes and camera parameter differences. This makes person Re-Identification (ReID) among multiple cameras a very challenging task. This work is motivated to learn mid-level human attributes which are robust to such visual appearance variations. And we propose a semi-supervised attribute learning framework which progressively boosts the accuracy of attributes only using a limited number of labeled data. Specifically, this framework involves a three-stage training. A deep Convolutional Neural Network (dCNN) is first trained on an independent dataset labeled with attributes. Then it is fine-tuned on another dataset only labeled with person IDs using our defined triplet loss. Finally, the updated dCNN predicts attribute labels for the target dataset, which is combined with the independent dataset for the final round of fine-tuning. The predicted attributes, namely \emph{deep attributes} exhibit superior generalization ability across different datasets. By directly using the deep attributes with simple Cosine distance, we have obtained surprisingly good accuracy on four person ReID datasets. Experiments also show that a simple metric learning modular further boosts our method, making it significantly outperform many recent works.

Citations (399)

View on Semantic Scholar

Summary

The paper's main contribution is a three-stage semi-supervised deep attribute learning framework that fuses labeled attributes and person IDs for enhanced multi-camera person re-identification.
It employs a novel attributes triplet loss during fine-tuning to ensure consistent feature representation across diverse camera views despite pose and illumination variations.
Experimental evaluations on datasets like VIPeR, PRID, GRID, and Market-1501 reveal significant improvements in Rank-1 accuracy, reducing annotation costs and enhancing generalization.

An Analysis of Deep Attributes Driven Multi-Camera Person Re-identification

The paper in question proposes an innovative semi-supervised approach for the Person Re-Identification (ReID) task, focusing on the extraction of robust mid-level human attributes through a Deep Convolutional Neural Network (dCNN). The authors establish a three-stage training protocol that strategically combines labeled attribute data and person ID data to produce what they term 'deep attributes'—features that enhance discrimination across different camera settings while addressing the challenges of pose variation, illumination changes, and other visual discrepancies.

Technical Approach

The proposed Semi-supervised Deep Attribute Learning (SSDAL) framework incorporates three primary stages:

Initial dCNN Training: In the first stage, a fully-supervised dCNN model is trained using an independent dataset with labeled attributes. The architecture parallels the AlexNet model but employs sigmoid cross-entropy loss to manage multi-label classification across a set of human attributes.
Fine-tuning with Attributes Triplet Loss: The second phase refines the dCNN using a dataset solely labeled with person IDs, leveraging a novel attributes triplet loss. This loss function ensures that the network learns to predict similar attributes for the same person and different attributes for different persons. By aligning person IDs with attributes data, the model enhances the correlation between photographic depictions of identity and attribute detail, thus refining the discriminative capacity of the dCNN.
Final Fine-tuning: For the last stage, the initially labeled dataset and the independent dataset are amalgamated to supervise another round of fine-tuning. This composite dataset aims to exploit both labeled attributes and refined attribute predictions for better accuracy and generalization in person ReID.

Experimental Results

The empirical evaluation spans four widely recognized datasets—VIPeR, PRID, GRID, and Market-1501. Noteworthy findings include:

On the two-camera PRID dataset, their approach surpasses contemporary methods, achieving a Rank-1 accuracy of 20.1%, which signifies a notable improvement over existing metrics-based and deep learning counterparts.
When evaluated on the multiview Market-1501 dataset, the SSDAL approach outperforms other state-of-the-art methods, achieving 40.1% and 48.2% in Rank-1 accuracy for single and multiple query scenarios, respectively.
Across various datasets, the proposed framework demonstrates the capability to improve classification accuracy by a considerable margin without additional dataset-specific fine-tuning.

Implications and Future Directions

The SSLDA approach signifies a practical advancement for the field of ReID by circumventing the necessity for extensive manually attributed data. This not only reduces data annotation burdens but also highlights the potential of semi-supervised frameworks in facilitating improved generalization across heterogeneous environments.

Moreover, the reliance on mid-level deep attributes as primary features liberates ReID systems from pervasive reliance on local features, thus streamlining the operations by eschewing complex feature extraction processes.

For future research, exploring the spatial interdependencies of attributes could enhance feature accuracy and effectiveness. Furthermore, integrating tracking algorithms to dynamically generate labeled datasets could provide an adaptive framework for real-time surveillance and security applications.

This paper corroborates the promise of employing deep learning for attribute detection within ReID tasks, suggesting a trajectory for continued exploration and enhancement in person-based visual recognition systems.

PDF Markdown