Siamese Instance Search for Tracking (1605.05863v1)

Published 19 May 2016 in cs.CV

Abstract: In this paper we present a tracker, which is radically different from state-of-the-art trackers: we apply no model updating, no occlusion detection, no combination of trackers, no geometric matching, and still deliver state-of-the-art tracking performance, as demonstrated on the popular online tracking benchmark (OTB) and six very challenging YouTube videos. The presented tracker simply matches the initial patch of the target in the first frame with candidates in a new frame and returns the most similar patch by a learned matching function. The strength of the matching function comes from being extensively trained generically, i.e., without any data of the target, using a Siamese deep neural network, which we design for tracking. Once learned, the matching function is used as is, without any adapting, to track previously unseen targets. It turns out that the learned matching function is so powerful that a simple tracker built upon it, coined Siamese INstance search Tracker, SINT, which only uses the original observation of the target from the first frame, suffices to reach state-of-the-art performance. Further, we show the proposed tracker even allows for target re-identification after the target was absent for a complete video shot.

Citations (579)

View on Semantic Scholar

Summary

The paper introduces a novel Siamese network that learns an invariant matching function for effective object tracking without continuous model updates.
It utilizes a two-stream architecture trained on the diverse ALOV dataset to robustly address occlusions, scale variations, and illumination changes.
Evaluated on the OTB dataset, the approach achieves competitive, state-of-the-art performance using a simple nearest-neighbor search mechanism.

Siamese Instance Search for Tracking: A Comprehensive Overview

The paper "Siamese Instance Search for Tracking," presented at CVPR 2016 by Ran Tao, Efstratios Gavves, and Arnold W.M. Smeulders, explores a unique approach to visual object tracking. Distinct from conventional techniques, the proposed methodology focuses on creating a robust matching function without model updating or explicit occlusion detection.

Overview

The central innovation in this work lies in leveraging a Siamese deep neural network to learn a generic and powerful matching function. This function allows tracking by simply identifying and following a target from the first frame throughout a video sequence. Unlike traditional methods, this approach operates without frequent model updates or integrating multiple tracking algorithms. The tracker, named Siamese INstance search Tracker (SINT), utilizes the learned matching function to track objects without requiring explicit adaptations to new targets.

Methodology

The proposed method involves a two-stream Siamese network, designed specifically for tracking purposes. This network is trained on a rich external dataset (ALOV), showcasing diverse appearance variations. By training on such data, the matching function becomes highly invariant to common challenging factors in a tracking context, including scale changes, occlusions, and varying illumination conditions.

The network architecture is defined by its depth and the strategic use of layers. Two variations are explored: one similar to AlexNet and another inspired by VGGNet. Key to this architecture is the deliberate reduction of max pooling layers, allowing for increased localization precision. Furthermore, the architecture incorporates features from multiple layers to enhance its discriminative power.

Results and Performance

The authors evaluated SINT on the OTB dataset, which includes 50 sequences capturing a wide range of tracking challenges. SINT achieved state-of-the-art performance, closely matching or outperforming other leading trackers such as MUSTer and MEEM. This achievement is notable considering the simplicity of the tracking inference utilized: essentially a sophisticated nearest-neighbor search guided by the learned matching function.

Implications and Future Work

The demonstrated capability of SINT to match advanced tracking performance levels without frequent model updates suggests substantial practical implications. Such a tracker can be particularly advantageous in real-time applications where computational efficiency and simplicity are crucial.

Additionally, SINT's inherent ability to re-identify targets after periods of absence (demonstrated on complex sequences, including those from YouTube and Star Wars) proposes exciting applications in long-term tracking scenarios. Future work may focus on further refining the matching function and exploring its integration with complementary techniques like optical flow for improved robustness.

Conclusion

"Siamese Instance Search for Tracking" provides an innovative perspective on visual tracking, prioritizing a well-trained, invariant matching function. This paper suggests valuable directions for future research, particularly in optimizing tracking systems to function effectively without cumbersome model adjustments. The advancements herein could influence ongoing developments in AI and machine learning, particularly those aimed at enhancing autonomous perception and recognition systems.

PDF Markdown