SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines (1911.06188v4)

Published 14 Nov 2019 in cs.CV

Abstract: Visual tracking problem demands to efficiently perform robust classification and accurate target state estimation over a given target at the same time. Former methods have proposed various ways of target state estimation, yet few of them took the particularity of the visual tracking problem itself into consideration. After a careful analysis, we propose a set of practical guidelines of target state estimation for high-performance generic object tracker design. Following these guidelines, we design our Fully Convolutional Siamese tracker++ (SiamFC++) by introducing both classification and target state estimation branch(G1), classification score without ambiguity(G2), tracking without prior knowledge(G3), and estimation quality score(G4). Extensive analysis and ablation studies demonstrate the effectiveness of our proposed guidelines. Without bells and whistles, our SiamFC++ tracker achieves state-of-the-art performance on five challenging benchmarks(OTB2015, VOT2018, LaSOT, GOT-10k, TrackingNet), which proves both the tracking and generalization ability of the tracker. Particularly, on the large-scale TrackingNet dataset, SiamFC++ achieves a previously unseen AUC score of 75.4 while running at over 90 FPS, which is far above the real-time requirement. Code and models are available at: https://github.com/MegviiDetection/video_analyst .

Citations (752)

View on Semantic Scholar

Summary

The paper introduces four key guidelines that decouple classification and state estimation to enhance tracking accuracy.
It designs SiamFC++ with anchor-free scoring and a quality assessment head to reduce ambiguity in target estimation.
Experiments on multiple benchmarks, including TrackingNet and VOT2018, validate its superior accuracy and real-time performance.

SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines

The paper "SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines" by Yinda Xu, Zeyu Wang, Zuoxin Li, Yuan Ye, and Gang Yu provides a systematic approach to improving visual tracking performance through the introduction of practical guidelines for target state estimation. The Fully Convolutional Siamese tracker++ (SiamFC++) is proposed, which integrates these guidelines, and demonstrates state-of-the-art performance across multiple benchmarks.

Overview of Contributions

The primary contributions of this paper are threefold:

Guideline Development: The authors identify four practical guidelines for designing high-performance generic object trackers:
- G1: Decomposition of classification and state estimation tasks.
- G2: Use of non-ambiguous scoring directly representing target confidence.
- G3: Avoidance of prior knowledge like scale/ratio distribution to ensure generalization.
- G4: Assessment of estimation quality using a separate score independent of classification.
Tracker Design: The proposed SiamFC++ is built based on Fully Convolutional Siamese networks. It incorporates a classification head and a regression head aligned with these guidelines. They eschew anchor boxes to avoid ambiguity (G2 and G3) and introduce an estimation quality assessment to improve bounding box selection (G4).
Empirical Validation: Extensive experiments and ablations validate the effectiveness of the proposed guidelines and demonstrate the superior performance of SiamFC++ on five challenging benchmarks: OTB2015, VOT2018, LaSOT, GOT-10k, and TrackingNet.

Implementation and Numerical Results

The SiamFC++ leverages a Siamese-based framework for feature extraction and matching. The feature maps derived from a common backbone are processed in parallel by classification and regression heads. These heads operate on pixels of the feature map directly correlated to sub-windows on the input image. Notably, the method avoids pre-defined anchor settings, thereby preventing matching ambiguity (G2) and dependency on prior data distribution (G3).

The authors provide detailed numerical results showcasing the model's performance:

TrackingNet: SiamFC++ achieves an AUC score of 75.4 with over 90 FPS.
VOT2018: SiamFC++ attains an EAO of 0.426, demonstrating strong robustness (0.183 robustness metric).
GOT-10k: With an AO of 59.5, it surpasses many contemporary methods.

The use of extensive ablation studies reveals that the combination of better data sources, improved head structures, and stronger backbones contribute significantly to the performance increments. For instance, the transition from AlexNet to GoogLeNet as the backbone resulted in notable performance improvements across different benchmarks.

Observations and Implications

The paper's proposition of guidelines grounded in the nature of visual tracking represents a structured approach to tackling common issues, such as scale variation and target estimation challenges. The elimination of anchor-based scoring reduces the propensity for false positives, which is critical for robust tracking.

The practical implications of this research extend to various applications needing real-time visual tracking, like UAV navigation and surveillance. The high tracking performance combined with real-time processing capability (90+ FPS) suggests that silicon implementations or deployments in embedded systems are viable.

Future Directions

This work opens the door for further exploration into other domains within visual tracking:

Robustness to Environmental Variations: Investigating the effectiveness of these guidelines under drastic environmental changes or in domains with less controlled illumination.
Generalization to Multiple Object Tracking (MOT): Extending the non-ambiguous scoring and quality assessment frameworks to MOT scenarios.
Integration with Other Model Architectures: Applying the proposed guidelines to more advanced backbone architectures (e.g., EfficientNet) and hybrid models integrating attention mechanisms.

Conclusion

"SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines" sets a new benchmark in visual tracking by systematically addressing the challenges inherent in object state estimation. The results substantiate that the considered guidelines lead to a robust, general-purpose tracker outperforming prior methods across major benchmarks. Future work could extend these insights to even broader applications, continuing to bridge theoretical developments with practical implementations in computer vision.

PDF Markdown

Related Papers

GitHub

GitHub - megvii-research/video_analyst: A series of basic algorithms that are useful for video understanding, including Single Object Tracking (SOT), Video Object Segmentation (VOS) and so on. (832 stars)

YouTube

Show All Videos