Revisiting Temporal Modeling for Video-based Person ReID (1805.02104v2)

Published 5 May 2018 in cs.CV

Abstract: Video-based person reID is an important task, which has received much attention in recent years due to the increasing demand in surveillance and camera networks. A typical video-based person reID system consists of three parts: an image-level feature extractor (e.g. CNN), a temporal modeling method to aggregate temporal features and a loss function. Although many methods on temporal modeling have been proposed, it is hard to directly compare these methods, because the choice of feature extractor and loss function also have a large impact on the final performance. We comprehensively study and compare four different temporal modeling methods (temporal pooling, temporal attention, RNN and 3D convnets) for video-based person reID. We also propose a new attention generation network which adopts temporal convolution to extract temporal information among frames. The evaluation is done on the MARS dataset, and our methods outperform state-of-the-art methods by a large margin. Our source codes are released at https://github.com/jiyanggao/Video-Person-ReID.

View on arXiv

Authors (2)

Jiyang Gao (28 papers)
Ram Nevatia (54 papers)

Citations (138)

View on Semantic Scholar

Summary

Revisiting Temporal Modeling for Video-based Person ReID

The paper "Revisiting Temporal Modeling for Video-based Person ReID" by Jiyang Gao and Ram Nevatia presents a comprehensive paper on the effectiveness of various temporal modeling methods in the context of video-based person re-identification (ReID). With the rising demand for surveillance and the extensive use of camera networks, the challenge of re-identifying individuals across different video sequences has gained substantial importance.

Core Components and Methodologies

The typical architecture for video-based person ReID comprises three components:

An image-level feature extractor, usually a Convolutional Neural Network (CNN).
A temporal modeling method for aggregating features over time.
A loss function for training.

While there have been numerous proposals regarding temporal modeling methods, comparing them has been challenging due to variations in feature extractors and loss functions used in different studies. This paper mitigates such issues by employing a standardized feature extractor (ResNet-50) and consistent loss functions (triplet loss and softmax cross-entropy loss). Four temporal modeling techniques are evaluated: temporal pooling, temporal attention, Recurrent Neural Networks (RNN), and 3D Convolutional Neural Networks (3D ConvNets).

Proposed Model and Results

In addition to comparing existing methods, the authors propose a new attention generation network incorporating temporal convolutions to harness temporal information between frames. All methods are assessed using the MARS dataset, the largest available dataset for video-based person ReID.

The performance evaluation indicates that:

Temporal pooling outperforms the baseline model (which uses no temporal aggregation) by 3%, highlighting its efficacy in capturing temporal features.
Temporal attention delivers similar improvements, slightly surpassing temporal pooling, particularly with the proposed temporal convolution-based attention mechanism.
RNN-based approaches show inferior performance, even below the image-based baseline. This suggests that RNNs may not efficiently capture the necessary temporal dependencies in this context.
3D ConvNets also exhibit comparatively lower performance, underscoring the importance of effective temporal aggregation rather than merely extending spatial convolutions.

The proposed temporal-conv-based attention model achieves the highest accuracy among the tested methodologies.

Implications and Future Directions

This research underscores the significance of selecting an appropriate temporal modeling strategy in video-based person ReID tasks. The demonstrated superiority of temporal pooling and attention-based methods suggests that these techniques are more adept at capturing instantaneous frame-level variances without relying heavily on complex recurrent architectures.

From a practical standpoint, the findings advocate for deploying simpler temporal aggregation methods like pooling or attention mechanisms, which are computationally less expensive while delivering superior accuracy.

As future directions, exploring methods to aggregate information across longer temporal horizons, such as entire videos rather than individual clips, remains a promising avenue. Additionally, further refinement of attention mechanisms may offer incremental gains without substantial increases in computational overhead. Understanding the nuanced temporal dynamics that affect video-based person ReID could unlock further potential in surveillance and related domains.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - jiyanggao/Video-Person-ReID: Video-based Person ReID Method Implementations on MARS (381 stars)