SiamVGG: Visual Tracking using Deeper Siamese Networks (1902.02804v4)

Published 7 Feb 2019 in cs.CV

Abstract: Recently, we have seen a rapid development of Deep Neural Network (DNN) based visual tracking solutions. Some trackers combine the DNN-based solutions with Discriminative Correlation Filters (DCF) to extract semantic features and successfully deliver the state-of-the-art tracking accuracy. However, these solutions are highly compute-intensive, which require long processing time, resulting unsecured real-time performance. To deliver both high accuracy and reliable real-time performance, we propose a novel tracker called SiamVGG\footnote{https://github.com/leeyeehoo/SiamVGG}. It combines a Convolutional Neural Network (CNN) backbone and a cross-correlation operator, and takes advantage of the features from exemplary images for more accurate object tracking. The architecture of SiamVGG is customized from VGG-16 with the parameters shared by both exemplary images and desired input video frames. We demonstrate the proposed SiamVGG on OTB-2013/50/100 and VOT 2015/2016/2017 datasets with the state-of-the-art accuracy while maintaining a decent real-time performance of 50 FPS running on a GTX 1080Ti. Our design can achieve 2% higher Expected Average Overlap (EAO) compared to the ECO and C-COT in VOT2017 Challenge.

Citations (46)

View on Semantic Scholar

Summary

The paper introduces SiamVGG, a novel architecture that integrates a modified VGG-16 backbone into a Siamese network to enhance feature discrimination.
It achieves robust real-time performance with metrics such as 33.15 FPS and high Expected Average Overlap on OTB and VOT datasets.
The study demonstrates a practical balance between deep semantic feature extraction and computational efficiency, ideal for embedded tracking systems.

SiamVGG: Visual Tracking using Deeper Siamese Networks

The paper "SiamVGG: Visual Tracking using Deeper Siamese Networks" presents a novel approach to visual object tracking in the domain of computer vision, leveraging advancements in deep neural networks (DNNs) and Siamese network architectures. Visual object tracking is an integral component of systems requiring real-time analysis and decision-making, such as surveillance, UAVs, and autonomous vehicles. The challenge lies in achieving high precision and real-time performance within computational constraints.

Summary of Contributions

SiamVGG innovatively incorporates a modified VGG-16 network as the backbone of a Siamese Network to address key limitations observed in previous framework structures such as SiamFC, which relied on AlexNet. In terms of architecture, SiamVGG adapts VGG-16's architecture to improve feature discrimination, hypothesizing that this will improve tracking accuracy without sacrificing computational efficiency. The paper evaluates SiamVGG on multiple established datasets — OTB-2013, OTB-50, OTB-100, and VOT datasets — and demonstrates superior performance in terms of accuracy metrics such as Expected Average Overlap (EAO) and frames per second (FPS) for real-time applications.

Methodology Overview

The crux of the SiamVGG approach is the integration of a deeper network architecture — VGG-16 — that has shown strong transfer learning capabilities in other tasks, into the Siamese Network structure used for tracking. By avoiding padding operations known to introduce noise into feature maps, SiamVGG maintains compact feature maps that increase tracking precision while speeding up computations. For training, the paper employs an end-to-end methodology with SoftMargin loss, utilizing comprehensive datasets including the ILSVRC and the Youtube-BB dataset to ensure robust learning across diverse visual scenarios.

Results and Performance Analysis

The paper provides compelling evidence that SiamVGG achieves top-tier performance across various tracking benchmarks. On OTB datasets, SiamVGG ranks among the highest for AUC values in OPE success plots, indicating robust tracking precision. On the VOT2017 real-time challenge, SiamVGG establishes itself as a leading competitor by achieving a real-time EAO of 0.275 with 33.15 FPS. Notably, it outperforms other Siamese Network-based trackers by substantial margins in both overlap and failure metrics.

Implications for Theory and Practice

From a theoretical standpoint, SiamVGG's utilization of deeper network architectures challenges prior assumptions regarding the sufficiency of simpler architectures like AlexNet for feature discrimination in tracking. Practically, SiamVGG's design adeptly balances the need for sophisticated semantic feature extraction with the necessity for reduced computation time, making it suitable for deployment in embedded systems and edge computing devices.

Future Directions

Future work might explore further optimization of SiamVGG for even lighter deployment or extend the architecture's robustness across varied luminance and occlusion conditions by integrating dynamic adjustments in the training phase. Moreover, further exploration into the parallelization of the SiamVGG process and its adaptation for novel hardware accelerators could push real-time capabilities even further.

In conclusion, the SiamVGG approach marks a significant advancement in visual tracking methodologies, offering both enhancement in accuracy and feasibility for real-time application, indicative of the potential for deeper networks in efficient, real-time semantic analysis.

PDF Markdown

Related Papers

GitHub

GitHub - leeyeehoo/SiamVGG: SiamVGG: Visual Tracking with Deeper Siamese Networks (79 stars)