Deep Learning for Visual Tracking: A Comprehensive Survey (1912.00535v2)

Published 2 Dec 2019 in cs.CV, cs.LG, and eess.IV

Abstract: Visual target tracking is one of the most sought-after yet challenging research topics in computer vision. Given the ill-posed nature of the problem and its popularity in a broad range of real-world scenarios, a number of large-scale benchmark datasets have been established, on which considerable methods have been developed and demonstrated with significant progress in recent years -- predominantly by recent deep learning (DL)-based methods. This survey aims to systematically investigate the current DL-based visual tracking methods, benchmark datasets, and evaluation metrics. It also extensively evaluates and analyzes the leading visual tracking methods. First, the fundamental characteristics, primary motivations, and contributions of DL-based methods are summarized from nine key aspects of: network architecture, network exploitation, network training for visual tracking, network objective, network output, exploitation of correlation filter advantages, aerial-view tracking, long-term tracking, and online tracking. Second, popular visual tracking benchmarks and their respective properties are compared, and their evaluation metrics are summarized. Third, the state-of-the-art DL-based methods are comprehensively examined on a set of well-established benchmarks of OTB2013, OTB2015, VOT2018, LaSOT, UAV123, UAVDT, and VisDrone2019. Finally, by conducting critical analyses of these state-of-the-art trackers quantitatively and qualitatively, their pros and cons under various common scenarios are investigated. It may serve as a gentle use guide for practitioners to weigh when and under what conditions to choose which method(s). It also facilitates a discussion on ongoing issues and sheds light on promising research directions.

Citations (263)

View on Semantic Scholar

Summary

The paper presents a comprehensive survey that analyzes deep learning approaches for visual tracking through nine key aspects and over 200 benchmarked trackers.
It systematically compares methods, highlighting Siamese-based networks and hybrid offline-online training across datasets like OTB2015 and VOT2018.
The findings offer practical guidelines for selecting optimal trackers and suggest future directions in custom architectures, meta-learning, and few-shot strategies.

Deep Learning for Visual Tracking: An Analysis

The paper, "Deep Learning for Visual Tracking: A Comprehensive Survey," presents a detailed examination of deep learning (DL)-based approaches within the domain of visual target tracking, one of the forefront research topics in computer vision. This paper stands out by offering a systematic and thorough investigation into state-of-the-art methodologies, benchmark datasets, evaluation metrics, and a critical analysis of leading methods in visual tracking. The discussion centers around nine key aspects of DL-based visual tracking methods, including network architecture, network exploitation, network training for visual tracking purposes, network objective, network output, exploitation of correlation filter advantages, aerial-view tracking, long-term tracking, and online tracking.

Strong Numerical Results and Bold Findings

The survey meticulously evaluates a broad spectrum of DL-based visual tracking methods. Specifically, it includes comparisons of over 200 state-of-the-art visual trackers on various benchmark datasets like OTB2013, OTB2015, VOT2018, LaSOT, UAV123, UAVDT, and VisDrone2019. These evaluations are presented according to quantitative performance metrics, highlighting top-performing methods based on different aspects such as precision, success rates, and failure occurrences.

The examination reveals that the Siamese-based networks currently hold prominence due to their robust performance and computational efficiency, particularly in real-time applications. However, the researchers present the notion that a meticulous combination of offline and online training, utilizing deep networks specifically optimized for visual tracking, significantly enhances tracking performance.

Implications for Research and Future Investigations

The implications of this research extend both practically and theoretically. Practically, it provides concrete insights for selecting appropriate DL-based visual tracking methods based on specific application requirements—highlighting the strengths and limitations of each method under variable conditions. The methods like ASRCF, UPDT, DRT, and DeepSTRCF demonstrate that despite potential limitations due to pre-trained model utilization for feature extraction, properly engineered DCF-based trackers can remain highly competitive.

Theoretically, the paper prompts further exploration into adapting deep networks for precise target model updates, a crucial step for managing occlusion and significant appearance variations. Additionally, it suggests exploiting richer representations beyond standard semantic feature maps, indicating an ongoing need to better leverage temporal, contextual, and auxiliary feature spaces.

Speculation on Future Directions in AI

The paper notes several potential directions for future work, critical among these is the need for custom architectures that simultaneously optimize for robustness, accuracy, and computational efficiency. It also encourages exploration into meta-learning and few-shot learning strategies in visual tracking, signaling their potential to enhance network adaptability to dynamic and unseen tracking scenarios swiftly.

Furthermore, addressing real-world aerial-view and long-term tracking challenges presents an intriguing avenue. Real-world conditions demand trackers not only be robust to visual distractors and appearance changes but also efficiently handle re-detections and broader spatial contexts typical in aerial imagery.

In conclusion, the survey sets a foundational benchmark for ongoing and future research within the field of DL-based visual tracking. It sheds light on existing challenges and highlights promising areas for innovation in developing adaptive, robust, and real-time capable tracking solutions. Researchers and practitioners in this field are well-positioned to build upon this comprehensive examination to advance the capabilities of visual trackers.

PDF Markdown