Semantics-Aligned Representation Learning for Person Re-identification (1905.13143v3)

Published 30 May 2019 in cs.CV

Abstract: Person re-identification (reID) aims to match person images to retrieve the ones with the same identity. This is a challenging task, as the images to be matched are generally semantically misaligned due to the diversity of human poses and capture viewpoints, incompleteness of the visible bodies (due to occlusion), etc. In this paper, we propose a framework that drives the reID network to learn semantics-aligned feature representation through delicate supervision designs. Specifically, we build a Semantics Aligning Network (SAN) which consists of a base network as encoder (SA-Enc) for re-ID, and a decoder (SA-Dec) for reconstructing/regressing the densely semantics aligned full texture image. We jointly train the SAN under the supervisions of person re-identification and aligned texture generation. Moreover, at the decoder, besides the reconstruction loss, we add Triplet ReID constraints over the feature maps as the perceptual losses. The decoder is discarded in the inference and thus our scheme is computationally efficient. Ablation studies demonstrate the effectiveness of our design. We achieve the state-of-the-art performances on the benchmark datasets CUHK03, Market1501, MSMT17, and the partial person reID dataset Partial REID. Code for our proposed method is available at: https://github.com/microsoft/Semantics-Aligned-Representation-Learning-for-Person-Re-identification.

PDF Abstract

Semantics-Aligned Representation Learning for Person Re-identification

The paper "Semantics-Aligned Representation Learning for Person Re-identification" introduces a novel framework designed to address the intrinsic challenges associated with the semantic misalignment in person re-identification (reID) tasks. The authors propose a Semantics Aligning Network (SAN) that enhances the ability of the model to learn aligned feature representations, which are crucial for effective re-identification across varied poses, camera angles, and occlusions.

Methodology

The proposed SAN involves a sophisticated architecture consisting of a base network known as the encoder (SA-Enc) and a decoder (SA-Dec). The encoder's function is to extract discriminative person features for the reID task, while the decoder is tasked with reconstructing a densely aligned full texture image from these features. This structure imposes a semantics alignment constraint on the encoder, promoting the learning of features that are implicitly aligned across different camera views and conditions.

Specifically, the alignment is facilitated by reconstructing a dense, semantic representation of a person, which aligns different perspectives into a unified representation. The decoder network adds perceptual losses based on Triplet ReID constraints, contributing to better identity preservation and consistent feature learning throughout the network.

To handle the lack of groundtruth full texture images in existing reID datasets, the authors propose the use of a pseudo groundtruth generation mechanism leveraging a pre-trained SAN-PG network on a synthesized dataset. This dataset, named Paired-Image-Texture (PIT), draws from synthetic images generated from the SURREAL dataset, thereby ensuring the availability of aligned texture images for training the SAN model.

Results

Empirically, the SAN framework demonstrates significant improvements over baseline methods on several benchmarks, including CUHK03, Market1501, and MSMT17 datasets. The results indicate an increase of 4% to 6.6% in Rank-1 and mAP metrics when compared with established baselines. Moreover, the paper reports competitive performance on challenging partial reID datasets like Partial REID and Partial-iLIDS, showing robustness in handling images with incomplete visibility.

The SAN approach effectively addresses the limitations posed by semantic misalignments, achieving fine-grained alignment within the non-rigid parts of the human body across images. The state-of-the-art performance achieved by the proposed framework highlights the significance of incorporating dense semantic alignment in reID systems, especially when dealing with diverse datasets where traditional alignment techniques may fall short.

Implications and Future Directions

The implications of this paper are profound for both theoretical and practical advancements in the field of person re-identification. The semantic alignment constraint imbued in the SAN design paves the way for more sophisticated models that can seamlessly transition between site-specific and cross-modal re-identification tasks. Additionally, the introduction of a synthesized paired dataset to facilitate dense semantic learning presents a methodological innovation that can be extended to other domains where real-world paired data is scarce or challenging to obtain.

Future research may explore the application of this semantics-aligned approach in other computer vision tasks such as vehicle re-identification, where similar challenges of misalignment are prevalent. Moreover, improvements in synthetic dataset diversity and realism could further elevate the effectiveness and generality of the SAN framework. As the landscape of AI continues to evolve, the integration of semantics-aligned representation learning could be a fundamental aspect of next-generation re-identification systems, offering rich, robust, and transferable feature representations.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Xin Jin (285 papers)
Cuiling Lan (60 papers)
Wenjun Zeng (130 papers)
Guoqiang Wei (14 papers)
Zhibo Chen (176 papers)

Citations (129)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - microsoft/Semantics-Aligned-Representation-Learning-for-Person-Re-identification: This is an implementation of AAAI'20 paper "Semantics-Aligned Representation Learning for Person Re-identification". We leverages dense semantics to address both the spatial misalignment and semantics misalignment challenges in person re-identification. (23 stars)