GLAD: Global-Local-Alignment Descriptor for Pedestrian Retrieval (1709.04329v1)

Published 13 Sep 2017 in cs.CV

Abstract: The huge variance of human pose and the misalignment of detected human images significantly increase the difficulty of person Re-Identification (Re-ID). Moreover, efficient Re-ID systems are required to cope with the massive visual data being produced by video surveillance systems. Targeting to solve these problems, this work proposes a Global-Local-Alignment Descriptor (GLAD) and an efficient indexing and retrieval framework, respectively. GLAD explicitly leverages the local and global cues in human body to generate a discriminative and robust representation. It consists of part extraction and descriptor learning modules, where several part regions are first detected and then deep neural networks are designed for representation learning on both the local and global regions. A hierarchical indexing and retrieval framework is designed to eliminate the huge redundancy in the gallery set, and accelerate the online Re-ID procedure. Extensive experimental results show GLAD achieves competitive accuracy compared to the state-of-the-art methods. Our retrieval framework significantly accelerates the online Re-ID procedure without loss of accuracy. Therefore, this work has potential to work better on person Re-ID tasks in real scenarios.

Authors (5)

Longhui Wei (40 papers)
Shiliang Zhang (132 papers)
Hantao Yao (23 papers)
Wen Gao (114 papers)
Qi Tian (314 papers)

Citations (394)

View on Semantic Scholar

Summary

The paper introduces GLAD, a descriptor that integrates global and local features via a four-stream CNN using robust human keypoints.
It utilizes a two-step approach by extracting body parts with the Deeper Cut model followed by discriminative descriptor learning to enhance Re-ID performance.
The retrieval framework employs two-fold divisive clustering to efficiently group images, reducing search space and improving real-time retrieval scalability.

Overview of GLAD: Global-Local-Alignment Descriptor for Pedestrian Retrieval

This paper presents a novel approach to addressing the challenges inherent in person Re-Identification (Re-ID) systems, particularly the variability in human poses and the misalignment of detected pedestrian images. The proposed solution involves two core innovations: the Global-Local-Alignment Descriptor (GLAD) and an efficient retrieval framework designed to enhance system performance in large dataset environments typical of video surveillance applications.

Methodology and Contributions

GLAD is designed to create a discriminative feature representation that effectively combines global and local features within pedestrian images. It employs a two-step approach:

Part Extraction: Unlike methods that use rigid feature extraction, GLAD employs four robust human keypoints to define three significant areas—the head, upper-body, and lower-body. These regions are extracted using the Deeper Cut model, which effectively handles variations in poses and viewpoints.
Descriptor Learning: A four-stream Convolutional Neural Network (CNN) is utilized to learn descriptors from both global and local regions. The CNN consists of shared convolutional layers that are optimized across multiple learning tasks corresponding to different body parts. This results in a feature vector, the GLAD, that is both high-dimensional and rich in discriminative cues.

The paper contrasts this method with current strategies, which often rely on fine-grained part extraction. GLAD instead optimizes this process by leveraging only those parts most reliably detected in diverse conditions, thereby avoiding the pitfalls of part detection noise and boosting system robustness.

Retrieval Framework

To complement GLAD, the authors propose a hierarchical indexing and retrieval framework. This framework incorporates a Two-fold Divisive Clustering (TDC) mechanism, effectively grouping redundant samples of individuals in the gallery set to minimize search space and accelerate retrieval processes. This indexing method clusters similar images without necessitating a pre-defined number of groups, thus optimizing both speed and scalability for real-time applications.

The retrieval process is twofold: first, relevant image groups are quickly identified using a lower-dimensional representation of GLAD, and then a detailed ranking of images is performed using the full descriptor.

Experimental Results

GLAD demonstrated superior performance across several leading datasets including Market1501, CUHK03, and VIPeR. Particularly noteworthy are its mAP and Rank-1 accuracy scores, which outperformed existing state-of-the-art methods by significant margins. This performance is largely attributable to GLAD’s balanced integration of global and local features and the novel retrieval framework's capacity to efficiently handle large-scale datasets.

Implications and Future Directions

The results of this research pivotal for advancing the practical implementation of Re-ID systems in real-world environments. The enhancement in retrieval speed and accuracy through GLAD and its associated indexing framework signals a step forward in surveillance applications where managing large data volumes is critical.

Future work could explore deeper integration of contextual metadata, such as temporal and geographical information, to further refine Re-ID accuracy and expand the applicability of these models. Additionally, the development of more efficient TDC algorithms could further optimize offline processing workloads, supporting the scalable deployment of high-performance Re-ID systems.

In conclusion, this paper provides a solid advancement in pedestrian retrieval, showcasing innovations in both descriptor learning and retrieval methodology that hold substantial promise for future developments in artificial intelligence and computer vision domains.

PDF Markdown