Part-Aligned Bilinear Representations for Person Re-identification (1804.07094v1)

Published 19 Apr 2018 in cs.CV

Abstract: We propose a novel network that learns a part-aligned representation for person re-identification. It handles the body part misalignment problem, that is, body parts are misaligned across human detections due to pose/viewpoint change and unreliable detection. Our model consists of a two-stream network (one stream for appearance map extraction and the other one for body part map extraction) and a bilinear-pooling layer that generates and spatially pools a part-aligned map. Each local feature of the part-aligned map is obtained by a bilinear mapping of the corresponding local appearance and body part descriptors. Our new representation leads to a robust image matching similarity, which is equivalent to an aggregation of the local similarities of the corresponding body parts combined with the weighted appearance similarity. This part-aligned representation reduces the part misalignment problem significantly. Our approach is also advantageous over other pose-guided representations (e.g., extracting representations over the bounding box of each body part) by learning part descriptors optimal for person re-identification. For training the network, our approach does not require any part annotation on the person re-identification dataset. Instead, we simply initialize the part sub-stream using a pre-trained sub-network of an existing pose estimation network, and train the whole network to minimize the re-identification loss. We validate the effectiveness of our approach by demonstrating its superiority over the state-of-the-art methods on the standard benchmark datasets, including Market-1501, CUHK03, CUHK01 and DukeMTMC, and standard video dataset MARS.

PDF Abstract

Insightful Overview of "Part-Aligned Bilinear Representations for Person Re-identification"

The paper "Part-Aligned Bilinear Representations for Person Re-identification" addresses the significant challenge of body part misalignment in person re-identification across disjoint camera views. This problem stems from the pose and viewpoint variability, as well as errors in detection systems, which can hinder the effectiveness of conventional global or grid-based image representations.

Synopsis of the Approach

The authors propose a novel two-stream deep network architecture designed to extract part-aligned bilinear representations, which effectively handle these misalignments. The network employs two distinct streams: one dedicated to appearance map extraction and the other to body part map extraction. These extracted maps are then combined through a bilinear pooling mechanism to produce a part-aligned representation that significantly mitigates misalignment issues.

Each local feature is calculated via a bilinear mapping of the corresponding appearance and part descriptors. This mechanism not only leverages the relative positioning of human body parts to achieve a robust representation but also minimizes the need for complex alignment processes or part bounding box predictions. Unlike previous methodologies reliant on pre-defined body parts or pose estimations, this approach learns optimal descriptors tailored specifically for re-identification tasks.

One of the notable aspects of this research is its independence from manual part annotations within the re-identification dataset, which is traditionally indispensable. Instead, the network is initialized using the pose estimation weights from a pre-trained sub-network, refining itself through the person re-identification loss, leading to unsupervised learning tailored to re-identification requirements.

Empirical Evaluation

The proposed model was rigorously validated on several benchmark datasets, including Market-1501, CUHK03, CUHK01, DukeMTMC, and a video dataset, MARS. The results demonstrated strong numerical gains over the state-of-the-art techniques in terms of both rank-1 accuracy and mean average precision (mAP). Notably, the approach showed excellent adaptability across both image-based and video-based re-identification scenarios.

Theoretical and Practical Implications

Theoretically, the introduction of a bilinear pooling strategy that combines detailed appearance and part maps offers a potentially new direction for reducing the bias introduced by fixed grid alignments. It highlights a promising avenue for further research into fine-grained feature aggregation techniques in computer vision tasks that suffer from misalignment issues.

Practically, the implications are substantial for video surveillance systems, where robust person re-identification across non-overlapping camera networks is crucial. The technique's ability to function without manual annotations makes it particularly attractive for large-scale systems.

Speculating Future Developments

Future developments might explore the integration of this approach with advanced temporal aggregation mechanisms for improved video sequence analysis. Enhanced pose estimation models could also provide deeper insights into the further refinement of part-aligned descriptors. Additionally, extending this framework to other re-identification tasks in untapped domains presents intriguing opportunities.

Overall, this paper introduces a refined computational approach that addresses one of the pivotal challenges in re-identification tasks and sets the stage for new methodologies in mitigating alignment issues within visual recognition systems.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Yumin Suh (16 papers)
Jingdong Wang (236 papers)
Siyu Tang (86 papers)
Tao Mei (209 papers)
Kyoung Mu Lee (107 papers)

Citations (499)

View on Semantic Scholar