Robust Visual Tracking via Hierarchical Convolutional Features (1707.03816v2)

Published 12 Jul 2017 in cs.CV

Abstract: In this paper, we propose to exploit the rich hierarchical features of deep convolutional neural networks to improve the accuracy and robustness of visual tracking. Deep neural networks trained on object recognition datasets consist of multiple convolutional layers. These layers encode target appearance with different levels of abstraction. For example, the outputs of the last convolutional layers encode the semantic information of targets and such representations are invariant to significant appearance variations. However, their spatial resolutions are too coarse to precisely localize the target. In contrast, features from earlier convolutional layers provide more precise localization but are less invariant to appearance changes. We interpret the hierarchical features of convolutional layers as a nonlinear counterpart of an image pyramid representation and explicitly exploit these multiple levels of abstraction to represent target objects. Specifically, we learn adaptive correlation filters on the outputs from each convolutional layer to encode the target appearance. We infer the maximum response of each layer to locate targets in a coarse-to-fine manner. To further handle the issues with scale estimation and re-detecting target objects from tracking failures caused by heavy occlusion or out-of-the-view movement, we conservatively learn another correlation filter, that maintains a long-term memory of target appearance, as a discriminative classifier. We apply the classifier to two types of object proposals: (1) proposals with a small step size and tightly around the estimated location for scale estimation; and (2) proposals with large step size and across the whole image for target re-detection. Extensive experimental results on large-scale benchmark datasets show that the proposed algorithm performs favorably against state-of-the-art tracking methods.

Citations (182)

View on Semantic Scholar

Summary

The paper introduces a novel tracking approach using hierarchical convolutional features to improve robustness and precise localization under appearance changes.
The method employs multi-layer CNN features to adaptively train correlation filters, balancing semantic richness and spatial precision for effective object tracking.
Extensive evaluations on benchmarks like OTB and VOT demonstrate that the approach outperforms state-of-the-art trackers in accuracy and resilience.

Robust Visual Tracking via Hierarchical Convolutional Features

The paper presents an innovative approach to visual object tracking by utilizing hierarchical convolutional features extracted from deep convolutional neural networks (CNNs). The authors aim to address the challenges in visual tracking caused by substantial appearance changes due to deformation, occlusion, abrupt motion, and cluttered backgrounds. Their method exploits the rich hierarchical representations of objects offered by CNNs to enhance tracking robustness and accuracy.

Methodology

The approach leverages the different levels of abstraction provided by multiple convolutional layers within a CNN. The deeper layers encode semantic information and are invariant to significant appearance variations, which is invaluable for maintaining track robustness. Conversely, the earlier layers retain higher spatial resolution, beneficial for precise localization of the object. Consequently, the authors interpret the hierarchical features across convolutional layers as non-linear counterparts of an image pyramid representation. This multi-layer feature extraction serves as the foundation for adaptive learning of correlation filters.

The tracking is formulated as a classification problem, where correlation filters are trained on CNN feature maps. Specifically, the algorithm infers the maximum response from each layer in a coarse-to-fine manner, thus fine-tuning localization with precise spatial details while maintaining robustness via semantically rich features from deeper layers.

The paper also addresses critical problems in tracking systems such as scale variation and re-detection of targets, particularly when they are heavily occluded or move out of the field of view. For these scenarios, another correlation filter is learned to retain a long-term memory of the target appearance, functioning as a discriminative classifier. This filter is applied to different types of object proposals, ensuring reliable scale estimation and re-detection following tracking failures.

Experimental Results

The proposed tracking algorithm is evaluated on large-scale benchmark datasets, including OTB2013, OTB2015, VOT2014, and VOT2015. The extensive experimental results demonstrate that the proposed method outperforms several state-of-the-art tracking algorithms in terms of both accuracy and robustness. The method exhibits superior tracking performance under various challenging conditions, such as background clutter, scale variations, and occlusions.

Implications and Future Directions

The proposed approach showcases significant advancement in visual object tracking by integrating deep learning with traditional model updating strategies through hierarchical correlation features. The hybrid methodology combining CNN-derived features with adaptive correlation filters provides a robust framework capable of handling complex tracking scenarios.

The implications of this work extend to various practical applications in computer vision, such as video surveillance, autonomous vehicles, and augmented reality, where reliable real-time tracking is critical. The integration of CNNs for feature extraction in tracking paradigms marks a step towards more sophisticated and capable visual systems.

Future research could explore more dynamic model updating mechanisms that automatically adapt parameters in real-time to further enhance robustness and speed. Additionally, evaluating the flexibility of this approach to various architectures and datasets beyond those tested could offer insights into its generalizability and potential in broader applications.

In conclusion, this research demonstrates a viable path for advancing visual tracking through the leverage of hierarchical convolutional features, setting a benchmark for accuracy and robustness in this domain.

PDF Markdown