Learning Multi-Domain Convolutional Neural Networks for Visual Tracking (1510.07945v2)

Published 27 Oct 2015 in cs.CV

Abstract: We propose a novel visual tracking algorithm based on the representations from a discriminatively trained Convolutional Neural Network (CNN). Our algorithm pretrains a CNN using a large set of videos with tracking ground-truths to obtain a generic target representation. Our network is composed of shared layers and multiple branches of domain-specific layers, where domains correspond to individual training sequences and each branch is responsible for binary classification to identify the target in each domain. We train the network with respect to each domain iteratively to obtain generic target representations in the shared layers. When tracking a target in a new sequence, we construct a new network by combining the shared layers in the pretrained CNN with a new binary classification layer, which is updated online. Online tracking is performed by evaluating the candidate windows randomly sampled around the previous target state. The proposed algorithm illustrates outstanding performance compared with state-of-the-art methods in existing tracking benchmarks.

Citations (2,423)

View on Semantic Scholar

Summary

The paper introduces a multi-domain CNN architecture (MDNet) that employs shared and domain-specific layers to enhance target tracking accuracy.
It presents an online tracking algorithm that adapts to appearance variations using strategies like hard negative mining and bounding box regression.
Experimental results on OTB and VOT benchmarks confirm the method's superior precision and robustness across challenging tracking scenarios.

An Overview of "Learning Multi-Domain Convolutional Neural Networks for Visual Tracking"

The paper "Learning Multi-Domain Convolutional Neural Networks for Visual Tracking" by Hyeonseob Nam and Bohyung Han presents a robust visual tracking algorithm leveraging the representational power of Convolutional Neural Networks (CNNs). This algorithm tackles the complexities inherent in visual tracking by employing a multi-domain learning strategy to enhance target representation and adaptability in diverse tracking scenarios.

Core Contributions

The paper's primary contributions revolve around:

Multi-Domain Learning Framework: The paper introduces a unique CNN architecture named Multi-Domain Network (MDNet). This network consists of shared layers to capture general features and multiple domain-specific branches for binary classification. This architecture allows learning a shared representation from multiple annotated video sequences, effectively separating domain-independent information from domain-specific features.
Online Tracking Algorithm: Leveraging the MDNet, the authors present an online tracking framework where the pre-trained shared layers are combined with a new binary classification layer for the target in any new sequence. This new layer is updated online to adapt to the target’s appearance variations, ensuring robustness and accuracy.
Efficient Online Adaptation Strategies: Various strategies are introduced for robust online adaptation. These include hard negative mining, which focuses on identifying and utilizing the most informative negative samples during tracking, and a bounding box regression step which refines target localization.

Key Methodological Details

The network architecture is designed optimally for visual tracking tasks, balancing complexity and efficiency. The architecture consists of five hidden layers—three convolutional layers, two fully connected layers, and additional domain-specific layers for each training video sequence. The training process uses Stochastic Gradient Descent (SGD), iterating over different sequence-specific branches to hone a generic target representation effectively.

Offline Pretraining

The pretraining step involves learning from multiple video sequences, each considered a separate domain. This multi-domain pretraining separates the shared representations (captured in the shared layers) from the domain-specific branches, which are updated iteratively.

Online Tracking and Updates

During online tracking, a new single branch is created for the target in the new video sequence. This single branch and the fully connected layers are updated online. The tracking approach involves generating candidate windows around the previous target state, evaluating them using the network, and selecting the candidate with the highest score. Long-term and short-term updates ensure adaptability by adjusting based on the extent and speed of the target’s appearance change.

Experimental Validation

The algorithm was tested on two widely-recognized benchmarks: Object Tracking Benchmark (OTB) and VOT2014. Results indicate the proposed MDNet outperforms existing state-of-the-art tracking methods significantly on both benchmarks. Key observations include:

OTB Results: In both OTB50 and OTB100 datasets, MDNet demonstrates superior performance, achieving the highest precision and success scores compared to other trackers. The success plots for different challenge attributes indicate MDNet's robustness across various challenging conditions like occlusion, rotation, and scale variation.
VOT2014 Results: The MDNet consistently ranks highest or near-highest in both accuracy and robustness scores. It is noted for its ability to handle imprecise initializations effectively, implying potential for long-term tracking applications by incorporating re-detection modules.

Implications and Future Directions

The proposed MDNet approach embodies significant advancements in visual tracking by effectively leveraging domain-specific and domain-independent information through a well-designed CNN architecture. Practical implications include improved tracking accuracy and robustness under varying conditions evident across multiple benchmarks. Theoretical implications highlight the effective separation of domain-specific and shared features, contributing to the continued evolution of multi-domain learning strategies within the AI community.

Future developments could delve into optimizing online updates further or incorporating additional contextual features to improve tracking stability. Additionally, exploring the use of deeper or more complex network architectures may yield further performance improvements, leveraging advances in CNNs' representational capabilities while balancing computational efficiency.

In conclusion, the MDNet framework demonstrates a substantial step forward in visual tracking performance, offering a compelling combination of efficiency, adaptability, and accuracy suitable for a broad range of applications in computer vision.