Pose Invariant Embedding for Deep Person Re-identification (1701.07732v1)

Published 26 Jan 2017 in cs.CV

Abstract: Pedestrian misalignment, which mainly arises from detector errors and pose variations, is a critical problem for a robust person re-identification (re-ID) system. With bad alignment, the background noise will significantly compromise the feature learning and matching process. To address this problem, this paper introduces the pose invariant embedding (PIE) as a pedestrian descriptor. First, in order to align pedestrians to a standard pose, the PoseBox structure is introduced, which is generated through pose estimation followed by affine transformations. Second, to reduce the impact of pose estimation errors and information loss during PoseBox construction, we design a PoseBox fusion (PBF) CNN architecture that takes the original image, the PoseBox, and the pose estimation confidence as input. The proposed PIE descriptor is thus defined as the fully connected layer of the PBF network for the retrieval task. Experiments are conducted on the Market-1501, CUHK03, and VIPeR datasets. We show that PoseBox alone yields decent re-ID accuracy and that when integrated in the PBF network, the learned PIE descriptor produces competitive performance compared with the state-of-the-art approaches.

Authors (4)

Liang Zheng (181 papers)
Yujia Huang (12 papers)
Huchuan Lu (199 papers)
Yi Yang (856 papers)

Citations (545)

View on Semantic Scholar

Summary

Pose Invariant Embedding for Deep Person Re-identification

The task of person re-identification (re-ID) presents unique challenges, primarily due to pedestrian misalignment caused by detection errors and pose variations. The paper "Pose Invariant Embedding for Deep Person Re-identification" addresses these challenges by proposing a novel pedestrian descriptor: the Pose Invariant Embedding (PIE).

Core Contributions

The paper introduces the PoseBox structure, a pivotal innovation aimed at aligning pedestrians to a standardized pose through pose estimation and affine transformations. This approach effectively reduces issues related to background noise and misalignment.

To mitigate the impact of pose estimation inaccuracies and information loss inherent in constructing PoseBoxes, a PoseBox Fusion (PBF) CNN architecture is introduced. The PBF network processes three input streams: the original image, the PoseBox, and the pose estimation confidence score. This integration results in a robust descriptor, the PIE, derived from the fully connected layer of the PBF network.

Experimental Validation

The research demonstrates the efficacy of the PoseBox and PIE through comprehensive experiments conducted on Market-1501, CUHK03, and VIPeR datasets. Results indicate that PoseBox alone offers commendable re-ID accuracy. When incorporated within the PBF network, PIE outperforms many state-of-the-art descriptors.

Specifically, on the Market-1501 dataset, the PIE achieved a rank-1 accuracy of 78.65% using ResNet-50, significantly outperforming baseline models trained solely on original images or PoseBoxes. The baseline models achieved a rank-1 accuracy of 73.02% and 64.49%, respectively, showcasing the superior performance of PIE in correcting misalignment errors.

Implications and Future Directions

The paper contributes to both the theoretical and practical domains by demonstrating that pose normalization and fusion methods substantially enhance re-ID systems' robustness. The integration of confidence scores provides an effective fallback strategy, enabling dynamic adjustments based on pose estimation reliability.

Future research may explore further improvements in pose estimation accuracy, which could enhance PoseBox construction results. Additionally, the development of end-to-end learning techniques may optimize PoseBox generation, potentially leading to even higher re-ID performance.

The work highlights the potential of using pose information in applications beyond re-ID, including action recognition and biometric systems, suggesting a broader impact on the field of AI and computer vision. The PIE framework opens new avenues for research into multi-stream fusion networks, presenting opportunities to extend these methods to various other visual recognition tasks.

PDF Markdown