Online Multi-Object Tracking Using CNN-based Single Object Tracker with Spatial-Temporal Attention Mechanism (1708.02843v2)

Published 9 Aug 2017 in cs.CV

Abstract: In this paper, we propose a CNN-based framework for online MOT. This framework utilizes the merits of single object trackers in adapting appearance models and searching for target in the next frame. Simply applying single object tracker for MOT will encounter the problem in computational efficiency and drifted results caused by occlusion. Our framework achieves computational efficiency by sharing features and using ROI-Pooling to obtain individual features for each target. Some online learned target-specific CNN layers are used for adapting the appearance model for each target. In the framework, we introduce spatial-temporal attention mechanism (STAM) to handle the drift caused by occlusion and interaction among targets. The visibility map of the target is learned and used for inferring the spatial attention map. The spatial attention map is then applied to weight the features. Besides, the occlusion status can be estimated from the visibility map, which controls the online updating process via weighted loss on training samples with different occlusion statuses in different frames. It can be considered as temporal attention mechanism. The proposed algorithm achieves 34.3% and 46.0% in MOTA on challenging MOT15 and MOT16 benchmark dataset respectively.

Citations (338)

View on Semantic Scholar

Summary

The paper introduces a CNN-based multi-object tracking framework that integrates spatial-temporal attention to effectively mitigate occlusion challenges.
It leverages ROI-Pooling and target-specific CNN layers for precise feature extraction and adaptive appearance modeling in dynamic environments.
Benchmarked on MOT15 and MOT16, the approach achieves MOTA scores of 34.3% and 46.0%, demonstrating robust real-time tracking performance.

Online Multi-Object Tracking Using CNN-based Single Object Tracker with Spatial-Temporal Attention Mechanism

The paper "Online Multi-Object Tracking Using CNN-based Single Object Tracker with Spatial-Temporal Attention Mechanism" addresses the critical issue of multi-object tracking (MOT) in challenging environments, such as crowded scenes with frequent occlusions. The proposed framework leverages a convolutional neural network (CNN) approach, traditionally effective in single object tracking, to enhance MOT by integrating a spatial-temporal attention mechanism (STAM).

Framework Introduction

The proposed algorithm is designed to extend the application of CNN-based single object trackers to MOT scenarios. It emphasizes computational efficiency by sharing features among multiple targets and utilizing ROI-Pooling for precise feature extraction for each individual object. The integration of target-specific CNN layers allows for adaptive appearance modeling, thus overcoming typical challenges faced in MOT such as occlusion and target interaction.

Spatial-Temporal Attention Mechanism

A novel contribution of this work is the spatial-temporal attention mechanism, which comprises spatial and temporal components. The spatial attention mechanism involves learning a visibility map to infer and apply a spatial attention map that weights features according to occlusion status. This aids in mitigating the drift effect due to occlusion and target interactions. Additionally, the temporal attention assesses the occlusion status dynamically, influencing the online updating process by emphasizing historical or current frame data as needed.

Numerical Results and Implications

The algorithm was benchmarked on the challenging MOT15 and MOT16 datasets, achieving MOTA scores of 34.3% and 46.0%, respectively. These results underscore the algorithm’s robustness in diverse and complex tracking environments.

In practical terms, the proposed method facilitates real-time applications by maintaining computational feasibility while enhancing tracking accuracy in occlusion-rich contexts, such as video surveillance and autonomous vehicles. Theoretically, the paper advances the understanding of leveraging CNN architectures for dynamic and adaptive learning in tracking multiple objects simultaneously.

Future Directions

Given its promising results, future work could explore extending the approach to other tracking modalities or integrating additional contextual information to further boost performance. Investigating hybrid frameworks that combine pre-trained object detection models with adaptive tracking systems could also offer significant advancements.

In conclusion, the paper presents a comprehensive approach using CNN-based trackers enriched with spatial-temporal attention for efficient and effective multi-object tracking. The integration of these components positions it as a valuable contribution to the MOT research landscape. Researchers and practitioners are encouraged to consider these mechanisms to address the pressing challenges encountered in real-world object tracking scenarios.

PDF Markdown