Deep Feature Learning with Relative Distance Comparison for Person Re-identification (1512.03622v1)

Published 11 Dec 2015 in cs.CV

Abstract: Identifying the same individual across different scenes is an important yet difficult task in intelligent video surveillance. Its main difficulty lies in how to preserve similarity of the same person against large appearance and structure variation while discriminating different individuals. In this paper, we present a scalable distance driven feature learning framework based on the deep neural network for person re-identification, and demonstrate its effectiveness to handle the existing challenges. Specifically, given the training images with the class labels (person IDs), we first produce a large number of triplet units, each of which contains three images, i.e. one person with a matched reference and a mismatched reference. Treating the units as the input, we build the convolutional neural network to generate the layered representations, and follow with the $L2$ distance metric. By means of parameter optimization, our framework tends to maximize the relative distance between the matched pair and the mismatched pair for each triplet unit. Moreover, a nontrivial issue arising with the framework is that the triplet organization cubically enlarges the number of training triplets, as one image can be involved into several triplet units. To overcome this problem, we develop an effective triplet generation scheme and an optimized gradient descent algorithm, making the computational load mainly depends on the number of original images instead of the number of triplets. On several challenging databases, our approach achieves very promising results and outperforms other state-of-the-art approaches.

Authors (4)

Shengyong Ding (9 papers)
Liang Lin (319 papers)
Guangrun Wang (43 papers)
Hongyang Chao (34 papers)

Citations (662)

View on Semantic Scholar

Summary

The paper introduces a CNN-based deep feature learning framework that uses triplet loss to enhance person re-identification.
It employs an efficient triplet generation scheme to reduce computational complexity by focusing on distinct images.
Experiments on iLIDS and VIPeR datasets yielded rank-1 accuracies of 52.1% and 40.5%, demonstrating improved robustness against variations.

Deep Feature Learning with Relative Distance Comparison for Person Re-identification

This paper presents a deep learning framework aimed at addressing the person re-identification (re-ID) problem. The task involves identifying a person across different camera views, which is challenging due to varying illumination, viewpoints, and poses.

Methodology

The framework utilizes a convolutional neural network (CNN) to learn feature representations that maximize relative distances between matched and mismatched image pairs. The network is trained using triplet units consisting of a query image, a matched reference, and a mismatched reference. The objective is to ensure that the $L_2$ distance between the query and matched images is smaller than that between the query and mismatched images.

To address the computational complexity arising from the cubic growth in the number of triplets, the authors propose an efficient triplet generation scheme. This scheme reduces the computational load by focusing on the number of input images rather than the number of triplets. The paper also introduces an optimized gradient descent algorithm that relies on only distinct images within triplets, enhancing efficiency.

Results

Experiments were conducted on challenging datasets such as iLIDS and VIPeR. The proposed method demonstrated superior performance over state-of-the-art approaches in terms of cumulative match curve (CMC) metrics. For the iLIDS dataset, the approach achieved a rank-1 accuracy of 52.1%, outperforming various traditional methods. On the VIPeR dataset, the framework attained a 40.5% rank-1 accuracy, indicating its robustness against significant variations in appearance.

Implications

The utilization of deep learning for feature extraction, combined with a relational triplet-based loss function, enhances the discriminative capability of the model. The efficient triplet sampling strategy addresses scalability, making this approach viable for large-scale applications in intelligent surveillance systems.

Future Directions

Further research could explore extending the model to leverage more complex architectures or additional data augmentation techniques, potentially enhancing accuracy and generalizability. Investigating the integration of multi-modal data, such as combining RGB with infrared data, could also provide improvements, particularly in varied lighting conditions.

In summary, this framework provides a scalable and effective solution to the person re-identification problem, marking significant improvements in handling intra-class variations while maintaining computational efficiency.

PDF Markdown