Diversity Regularized Spatiotemporal Attention for Video-based Person Re-identification (1803.09882v1)

Published 27 Mar 2018 in cs.CV

Abstract: Video-based person re-identification matches video clips of people across non-overlapping cameras. Most existing methods tackle this problem by encoding each video frame in its entirety and computing an aggregate representation across all frames. In practice, people are often partially occluded, which can corrupt the extracted features. Instead, we propose a new spatiotemporal attention model that automatically discovers a diverse set of distinctive body parts. This allows useful information to be extracted from all frames without succumbing to occlusions and misalignments. The network learns multiple spatial attention models and employs a diversity regularization term to ensure multiple models do not discover the same body part. Features extracted from local image regions are organized by spatial attention model and are combined using temporal attention. As a result, the network learns latent representations of the face, torso and other body parts using the best available image patches from the entire video sequence. Extensive evaluations on three datasets show that our framework outperforms the state-of-the-art approaches by large margins on multiple metrics.

Authors (4)

Shuang Li (203 papers)
Slawomir Bak (4 papers)
Peter Carr (21 papers)
Xiaogang Wang (230 papers)

Citations (335)

View on Semantic Scholar

Summary

The paper introduces a novel spatiotemporal attention mechanism that dynamically extracts local regions to overcome occlusion and misalignment in video sequences.
It employs a diversity regularization term based on the Hellinger distance to encourage unique spatial attention across different body parts.
Experimental evaluations show state-of-the-art performance with rank-1 accuracies of 93.2% on PRID2011, 80.2% on iLIDS-VID, and 82.3% on MARS.

Diversity Regularized Spatiotemporal Attention for Video-based Person Re-identification

The paper introduces a novel approach for video-based person re-identification, focusing on the challenges associated with occlusion and spatial misalignment inherent in video sequences. Traditional methods frequently use aggregate representations for frames, which are susceptible to interference from occluded frames. Instead, this work proposes a new spatiotemporal attention model that aims to automatically identify distinctive body parts across video frames, enhancing the robustness and accuracy of identification in video sequences.

Key Contributions

Spatiotemporal Attention Model: The model leverages a spatiotemporal attention mechanism to discover salient image regions dynamically. It automatically creates spatial attention models that localize distinctive regions like the face or torso, which are pooled temporally using attention mechanisms.
Diversity Regularization: To prevent redundancy among multiple spatial attention models focusing on similar body parts, a diversity regularization term based on the Hellinger distance is employed. This regularization ensures a diverse set of region detections, improving the representation of the person.
Organized Feature Extraction: Features extracted from the focused local regions are organized according to the spatial attention model and then combined using a temporal attention approach. This strategy maximizes the use of available information in each frame, leading to a more compact and informative representation of the person across the entire video sequence.
State-of-the-art Performance: The proposed approach demonstrates strong performance across various datasets, significantly outperforming existing methods. Through comprehensive evaluations on PRID2011, iLIDS-VID, and MARS datasets, the framework exhibits notable improvements in rank-1 accuracy and mean average precision (mAP), underscoring its advantages in person re-identification tasks.

Numerical Results

The experimental results reveal the superiority of the proposed method in handling the challenges posed by occlusion and misalignment. Notably, the technique achieves rank-1 accuracies of 93.2% on PRID2011, 80.2% on iLIDS-VID, and 82.3% on MARS, surpassing prior state-of-the-art results. These metrics illustrate that the spatiotemporal attention approach is well-suited for tackling the complexities of video re-identification.

Implications and Future Directions

This paper's contributions lie in its strategic integration of spatiotemporal attention mechanisms, coupled with a diversity-enforced approach to handling visual data from video sequences. As the field progresses, this methodology could serve as a basis for more sophisticated models that integrate other aspects of human recognition, such as movement patterns and environmental context. Future developments might investigate the transferability of these techniques to broader applications, including multi-view learning, cross-modal re-identification, and real-time surveillance systems.

In conclusion, this work offers a compelling path for enhancing re-identification in video sequences, providing a robust framework that could be expanded and adjusted for various advanced applications in computer vision and beyond.

PDF Markdown