Siamese Box Adaptive Network for Visual Tracking (2003.06761v2)

Published 15 Mar 2020 in cs.CV

Abstract: Most of the existing trackers usually rely on either a multi-scale searching scheme or pre-defined anchor boxes to accurately estimate the scale and aspect ratio of a target. Unfortunately, they typically call for tedious and heuristic configurations. To address this issue, we propose a simple yet effective visual tracking framework (named Siamese Box Adaptive Network, SiamBAN) by exploiting the expressive power of the fully convolutional network (FCN). SiamBAN views the visual tracking problem as a parallel classification and regression problem, and thus directly classifies objects and regresses their bounding boxes in a unified FCN. The no-prior box design avoids hyper-parameters associated with the candidate boxes, making SiamBAN more flexible and general. Extensive experiments on visual tracking benchmarks including VOT2018, VOT2019, OTB100, NFS, UAV123, and LaSOT demonstrate that SiamBAN achieves state-of-the-art performance and runs at 40 FPS, confirming its effectiveness and efficiency. The code will be available at https://github.com/hqucv/siamban.

Citations (621)

View on Semantic Scholar

Summary

The paper presents SiamBAN, an anchor-free framework unifying classification and regression for streamlined visual tracking.
It achieves state-of-the-art accuracy on benchmarks like VOT2018, VOT2019, and OTB100 while operating at 40 FPS.
The approach leverages end-to-end training with a modified ResNet-50 backbone and atrous convolution for enhanced spatial feature extraction.

Siamese Box Adaptive Network for Visual Tracking: A Detailed Overview

The paper discusses a novel visual tracking framework known as the Siamese Box Adaptive Network (SiamBAN), which leverages the unified classification and regression capabilities of fully convolutional networks (FCNs) to enhance visual tracking performance. Traditional tracking methods often rely on multi-scale search schemes or predefined anchor boxes, but these approaches entail significant computational overhead and require heuristic configuration. SiamBAN circumvents these limitations by adopting an anchor-free design, thus providing a more streamlined and flexible solution.

Core Contributions

The authors delineate three key contributions of their work:

Design of SiamBAN: The proposed framework eliminates the need for prior anchor boxes, simplifying the parameter space and enabling more general applications. This is achieved through an innovative use of FCNs that integrates classification and regression into a singular, cohesive task.
Framework Efficiency: The architecture achieves state-of-the-art tracking results, demonstrated through rigorous testing across multiple benchmarks such as VOT2018, VOT2019, and OTB100, among others. Additionally, SiamBAN operates at a robust 40 FPS, highlighting its efficiency for real-time deployment.
End-to-End Training: SiamBAN can be trained in a fully end-to-end manner on large-scale, annotated datasets, facilitating deep convolutional neural networks to optimize effectively for visual tracking tasks.

Technical Overview

SiamBAN's framework is constructed on a Siamese network backbone, consisting of two identical branches tasked with processing the template patch and the search patch. The use of ResNet-50 as a backbone, modified with atrous convolution, allows for detailed spatial feature extraction without compromising on computational efficiency.

The innovative box adaptive head comprises a dual-module approach: one for classification and another for regression. This method uses depth-wise cross-correlation to perform dense predictions directly from the convolutional features, thus transforming the tracking problem into a classification-regression problem without the dependency on candidate boxes.

The paper also introduces a novel approach to sample label assignment, enhancing the precision of target boundary predictions. Instead of relying on simple geometric shapes, a more nuanced method involving ellipses is used to demarcate training samples, thereby improving the framework's efficacy in distinguishing between foreground and background.

Experimental Results

SiamBAN's performance was evaluated against prominent state-of-the-art trackers. On the VOT2018 benchmark, it attained the highest EAO and demonstrated competitive robustness and accuracy. Similarly, on VOT2019, it outperformed existing real-time trackers in EAO and accuracy.

In the OTB100 benchmark, SiamBAN achieved competitive results akin to trackers deploying more complex designs. Furthermore, the evaluation on UAV123 confirmed SiamBAN's suitability for real-time UAV applications due to its high precision and adaptability across dynamic scenarios.

Theoretical and Practical Implications

This work signifies a step forward in simplifying visual tracking frameworks by adopting an anchor-free methodology. The resultant reduction in hyper-parameter tuning and computational complexity is noteworthy. The adaptable architecture suggests several potential directions for future research, particularly in enhancing model robustness to occlusions and variable appearance changes in targets. Additionally, the insights garnered can inform the development of more efficient models applicable across diverse computer vision tasks.

Future Directions

The proposed SiamBAN opens pathways for further exploration into anchor-free frameworks within AI and computer vision. Future research may involve refining the feature extraction components to better handle environmental variability and integrating advanced prediction mechanisms that can dynamically adjust to real-time data streams.

In conclusion, the SiamBAN framework offers a promising direction for advancing robust and efficient visual tracking, contributing valuable insights into end-to-end, anchor-free network design.

PDF Markdown

Related Papers

GitHub

GitHub - hqucv/siamban: Siamese Box Adaptive Network for Visual Tracking (282 stars)