End-to-end representation learning for Correlation Filter based tracking

Published 20 Apr 2017 in cs.CV and cs.LG | (1704.06036v1)

Abstract: The Correlation Filter is an algorithm that trains a linear template to discriminate between images and their translations. It is well suited to object tracking because its formulation in the Fourier domain provides a fast solution, enabling the detector to be re-trained once per frame. Previous works that use the Correlation Filter, however, have adopted features that were either manually designed or trained for a different task. This work is the first to overcome this limitation by interpreting the Correlation Filter learner, which has a closed-form solution, as a differentiable layer in a deep neural network. This enables learning deep features that are tightly coupled to the Correlation Filter. Experiments illustrate that our method has the important practical benefit of allowing lightweight architectures to achieve state-of-the-art performance at high framerates.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (1,380)

View on Semantic Scholar

Summary

The paper introduces a novel end-to-end training approach that embeds correlation filters as a differentiable layer to optimize CNN features for tracking.
The methodology employs a fully-convolutional Siamese network, enabling efficient and robust online tracking even with shallow architectures.
Experimental results demonstrate that CFNet delivers competitive accuracy and real-time speeds, making it ideal for resource-constrained applications.

End-to-end Representation Learning for Correlation Filter Based Tracking

The paper "End-to-end Representation Learning for Correlation Filter Based Tracking" by Valmadre et al. presents a significant advancement in the field of object tracking by combining Correlation Filters (CF) with deep learning techniques. This paper addresses a critical limitation in previous works by offering a novel approach to train deep features specifically for CF-based tracking.

Introduction and Motivation

Object tracking in video sequences is a fundamental problem in computer vision. A prominent challenge in this domain is the need to adapt models to previously unseen objects with limited data at test time. Traditional approaches either adapt a pre-trained deep convolutional neural network (CNN) using techniques like stochastic gradient descent (SGD) or utilize feature embeddings learned offline. However, these methods either suffer from computational inefficiency or fail to exploit video-specific cues.

Correlation Filters (CF) provide an efficient alternative for online learning due to their simple and fast solution in the Fourier domain. Prior works combined CF with pre-trained CNN features but lacked a deep integration between the components.

Contributions

The core contribution of this paper is the reinterpretation of the CF learner as a differentiable layer in a deep neural network, facilitating end-to-end training. This integration allows for the joint optimization of CNN features and CF, leading to robust and lightweight tracking architectures capable of state-of-the-art performance.

Methodology

The authors present CFNet, an architecture that integrates CF as a layer within a fully-convolutional Siamese network. The inclusion of the CF layer translates the problem of CF as an optimization problem that can now be differentiated through, thus enabling end-to-end learning.

Mathematical Formulation

In the single-channel case, the CF solves a regularized deconvolution problem characterized by minimizing the squared error between the predicted and target responses, where the solution is efficiently computed using Fourier transforms. The multi-channel case extends this to consider feature maps from a CNN, simplifying to a combination of channels and their respective filters.

Implementation

The integration of CF into the network is carefully handled by defining the back-propagation through the CF solution. This involves differentiating through the regularized deconvolution problem by using properties of the Discrete Fourier Transform (DFT).

Experimental Analysis

Experiments conducted demonstrate the efficacy of CFNet across various benchmarks. Key findings include:

Performance with Shallow Networks: CFNet significantly outperforms baseline Siamese networks when using shallow architectures. Shallow networks with CF layers achieve comparable accuracy to deeper networks while maintaining high framerates.
Comparison with State-of-the-art: CFNet at various depths achieves competitive performance on OTB benchmarks, underscoring the benefit of end-to-end training for CF-based tracking.
Practical Efficiency: Lightweight CFNet models operate at real-time speeds, making them suitable for embedded systems with restricted computational resources.

Implications and Future Directions

The integration of CF as a differentiable layer within CNNs opens new avenues for enhancing online learning in object tracking. The theoretical and practical benefits demonstrated in CFNet suggest future research directions:

Temporal Adaptation: Enhancing CFNet to adapt over video sequences dynamically.
Meta-learning: Using the gradient propagation through CF to fine-tune models for specific tracking tasks dynamically.
Domain Adaptation: Applying similar integrated learning techniques to other computer vision tasks that necessitate rapid model adaptation with limited data.

Conclusion

Valmadre et al. present a significant advancement in correlation filter-based tracking through end-to-end learning. CFNet leverages the efficiency of CF and the representational power of deep networks, illustrating superior performance with reduced model complexity. This work bridges the gap between robust online learning and the discriminative power of deep features, advancing the state of visual object tracking.

Markdown Report Issue