Person Re-Identification by Discriminative Selection in Video Ranking (1601.06260v1)

Published 23 Jan 2016 in cs.CV

Abstract: Current person re-identification (ReID) methods typically rely on single-frame imagery features, whilst ignoring space-time information from image sequences often available in the practical surveillance scenarios. Single-frame (single-shot) based visual appearance matching is inherently limited for person ReID in public spaces due to the challenging visual ambiguity and uncertainty arising from non-overlapping camera views where viewing condition changes can cause significant people appearance variations. In this work, we present a novel model to automatically select the most discriminative video fragments from noisy/incomplete image sequences of people from which reliable space-time and appearance features can be computed, whilst simultaneously learning a video ranking function for person ReID. Using the PRID$2011$, iLIDS-VID, and HDA+ image sequence datasets, we extensively conducted comparative evaluations to demonstrate the advantages of the proposed model over contemporary gait recognition, holistic image sequence matching and state-of-the-art single-/multi-shot ReID methods.

View on arXiv

Authors (4)

Taiqing Wang (5 papers)
Shaogang Gong (94 papers)
Xiatian Zhu (139 papers)
Shengjin Wang (65 papers)

Citations (222)

View on Semantic Scholar

Summary

Person Re-Identification by Discriminative Selection in Video Ranking

This paper presents a paper on person re-identification (ReID) by leveraging discriminative selection within video ranking frameworks. Conventional approaches in ReID typically rely on spatial appearance information extracted from single image frames of individuals captured across various surveillance scenarios. This reliance often restricts a model's robustness to adverse visual conditions such as occlusions, varying camera perspectives, and background clutter. In addressing these challenges, the authors propose a framework that integrates both appearance and space-time dynamics into the person ReID process using video sequences.

Model Overview

The proposed framework introduces a novel discriminative video fragment selection and ranking methodology. This method strategically segments video into candidates to derive space, time, and appearance features for effective discriminative power. A key component of the approach lies in employing video fragmentation through motion energy profiling. The fragments are then assessed to extract the most informative parts, automatically. These enable the computation of space-time features for a robust ReID process.

Fragment Representation: Each fragment is represented with both the color appearance and HOG3D descriptors, exemplifying a compound feature that captures dynamic and static properties of the appearance. This is crucial for segments that incorporate important gait or motion-specific sequences that are literally unrepresentable in a static frame.

Learning Algorithm

A multi-instance ranking (MIR) strategy, embedded with a discriminative selection mechanism, is introduced. Through multi-instance selection and ranking, the framework aims to learn an effective re-identification function that is insensitive to noise in video data: occlusions or incomplete information. The model iteratively selects the most informative video fragment pairings between sequences in non-overlapping camera views to train an identity-discriminative ranking function.

The selection mechanism focuses on minimizing the cumulative loss across all pairings of fragment instances—propelled by a robust max-margin learning strategy. Thereby, the algorithm considers the most discriminative fragment pair selections, augmenting the function's ability to rank individuals consistently across varying views.

Empirical Findings

The authors conducted experiments predominantly across three benchmark datasets—iLIDS-VID, PRID2011, and HDA+. The evaluations underscore substantial improvements of their method compared to existing and relevant competing methods, including those based on gait recognition, dynamic time warping sequence matching, and static spatial representation-based models.

The strength of the proposed approach is particularly evident on datasets manifesting elevated challenges such as severe occlusion and low-frame rate—which adversely affect traditional single-shot and multi-shot models. The results also reveal that complementing existing spatial feature-based models with the proposed DVR (Discriminative Video Ranking) model significantly improves re-identification rates, highlighting the model's potential to supplement existing frameworks with more robust, dynamic information.

Implications and Future Work

The empirical evidence provided suggests that dynamic, temporally-aware features across complementary data fragments provide an underutilized yet potent source of discriminative features necessary for superior ReID performance. As surveillance networks modernize—with better frame rates and interconnected large-scale systems—the DVR approach can adapt to harness expanding streams of data beyond conventional single-camera, static-image-based methodologies.

Future directions could investigate scalability across more extensive networks, involving increased camera setups, integration with semantic attribute-labeling systems, and coping with interperson appearance variability—such as clothing style changes. Furthermore, efforts to extend this framework in open-world scenarios, where people appearing in novel contexts without pre-existing identification in the gallery set, hold promising potential for further research. These enhancements would address prevailing practical challenges in real-world deployments, further aligning with expectations of developing fully autonomous multi-camera surveillance systems.

Related Papers

Find Related Papers