A Twofold Siamese Network for Real-Time Object Tracking (1802.08817v1)

Published 24 Feb 2018 in cs.CV

Abstract: Observing that Semantic features learned in an image classification task and Appearance features learned in a similarity matching task complement each other, we build a twofold Siamese network, named SA-Siam, for real-time object tracking. SA-Siam is composed of a semantic branch and an appearance branch. Each branch is a similarity-learning Siamese network. An important design choice in SA-Siam is to separately train the two branches to keep the heterogeneity of the two types of features. In addition, we propose a channel attention mechanism for the semantic branch. Channel-wise weights are computed according to the channel activations around the target position. While the inherited architecture from SiamFC \cite{SiamFC} allows our tracker to operate beyond real-time, the twofold design and the attention mechanism significantly improve the tracking performance. The proposed SA-Siam outperforms all other real-time trackers by a large margin on OTB-2013/50/100 benchmarks.

Citations (548)

View on Semantic Scholar

Summary

The paper introduces a twofold Siamese network that leverages separate appearance and semantic branches to enhance real-time tracking efficiency.
It employs a channel attention mechanism to dynamically adjust feature weights, significantly boosting precision on benchmarks like OTB and VOT.
The approach delivers robust tracking performance without online training, making it ideal for surveillance and autonomous navigation applications.

A Twofold Siamese Network for Real-Time Object Tracking

The paper "A Twofold Siamese Network for Real-Time Object Tracking" introduces the SA-Siam network, a novel approach to enhance real-time visual object tracking using complementary features learned through a two-branch Siamese architecture. This method leverages semantic and appearance features to address the challenges of high performance and real-time tracking efficiency.

Network Architecture

The SA-Siam network is composed of two distinct branches: an appearance branch and a semantic branch. Each branch is a fully convolutional Siamese network designed to compute similarity scores between a target and a search image. The appearance branch utilizes features that focus on similarity learning, while the semantic branch employs pre-trained deep features from image classification tasks.

A key design choice is to train these branches separately, preserving the heterogeneity of the features. This separation allows the network to leverage high-level semantic information along with appearance-based features that have more immediate discriminative power.

Channel Attention Mechanism

To further enhance the semantic branch, the authors introduce a channel attention mechanism. This mechanism computes channel-wise weights based on activations around the target position, allowing for limited target adaptation and improved emphasis on relevant feature channels. The channel weights are adjusted dynamically, increasing the discriminative power of the tracker. Evaluations demonstrate that introducing this mechanism enables the SA-Siam tracker to outperform existing real-time trackers on several benchmarks, including OTB-2013/50/100.

Empirical Results

Experiments reveal that combining the semantic and appearance features markedly improves tracking performance. On the OTB benchmarks, SA-Siam achieved significant gains in area-under-curve (AUC) and precision compared to standalone semantic or appearance models. The performance enhancements are more pronounced when multilevel features and the attention mechanism are utilized.

The SA-Siam tracker also shows competitive performance on VOT benchmarks, achieving high accuracy and robustness scores while maintaining real-time speed.

Implications and Future Directions

The implications of this work are both practical and theoretical. Practically, the proposed method enables more robust real-time object tracking without requiring online training, making it well-suited for applications like surveillance and autonomous navigation. Theoretically, it highlights the utility of preserving feature heterogeneity and utilizing attention mechanisms to improve tracking reliability.

Future developments could explore enhancing the fusion mechanism between branches further, or extending this architecture to different domains or modalities. The success of the channel-wise attention model may also encourage more refined adaptation techniques tailored to object tracking tasks.

In conclusion, the SA-Siam network demonstrates a promising direction for utilizing deep learning architectures in real-time tracking, offering a practical solution to enhance both speed and accuracy by integrating disparate feature sets.

PDF Markdown