Deeper and Wider Siamese Networks for Real-Time Visual Tracking (1901.01660v3)

Published 7 Jan 2019 in cs.CV

Abstract: Siamese networks have drawn great attention in visual tracking because of their balanced accuracy and speed. However, the backbone networks used in Siamese trackers are relatively shallow, such as AlexNet [18], which does not fully take advantage of the capability of modern deep neural networks. In this paper, we investigate how to leverage deeper and wider convolutional neural networks to enhance tracking robustness and accuracy. We observe that direct replacement of backbones with existing powerful architectures, such as ResNet [14] and Inception [33], does not bring improvements. The main reasons are that 1)large increases in the receptive field of neurons lead to reduced feature discriminability and localization precision; and 2) the network padding for convolutions induces a positional bias in learning. To address these issues, we propose new residual modules to eliminate the negative impact of padding, and further design new architectures using these modules with controlled receptive field size and network stride. The designed architectures are lightweight and guarantee real-time tracking speed when applied to SiamFC [2] and SiamRPN [20]. Experiments show that solely due to the proposed network architectures, our SiamFC+ and SiamRPN+ obtain up to 9.8%/5.7% (AUC), 23.3%/8.8% (EAO) and 24.4%/25.0% (EAO) relative improvements over the original versions [2, 20] on the OTB-15, VOT-16 and VOT-17 datasets, respectively.

Citations (810)

View on Semantic Scholar

Summary

The paper introduces Cropping-Inside Residual (CIR) units to mitigate padding effects and improve spatial localization in deep tracking networks.
It designs tailored deeper and wider architectures (CIResNet, CIResInception, CIResNeXt) that optimize receptive field and stride, boosting efficiency.
Experimental results on OTB, VOT-15/16/17 show up to 23.3% EAO improvement, demonstrating enhanced tracking accuracy and real-time performance.

Deeper and Wider Siamese Networks for Real-Time Visual Tracking

This paper presents a systematic paper and subsequent redesign of deeper and wider convolutional neural networks (CNNs) for Siamese trackers in real-time visual tracking tasks. The primary focus is on enhancing the tracking accuracy and robustness by addressing the inadequacies found in the shallow backbones currently employed, such as AlexNet.

Key Insights and Contributions

Critical Analysis of Deeper and Wider Networks:
- The authors analyzed the effects of deeper and wider networks like ResNet and Inception on visual tracking tasks.
- Experimentally, they observed that naive substitution of these architectures led to degraded tracking performance due to large receptive fields and positional biases introduced by padding in convolutions.
Introduction of Cropping-Inside Residual (CIR) Units:
- To counteract the downsides of large receptive fields and padding-induced biases, the paper introduces novel CIR units.
- CIR units are designed by incorporating cropping operations within residual blocks to eliminate padding effects and limit the receptive field.
- This innovation maintains feature map fidelity by removing padding-affected features, ensuring more precise feature embeddings and spatial localization.
Design of Deeper/Wider Network Architectures:
- The paper details the construction of various deeper (CIResNet series) and wider networks (CIResInception and CIResNeXt) tailored for Siamese trackers.
- These architectures carefully monitor and control network stride, receptive field sizes, and feature sizes to optimize both computational efficiency and tracking accuracy.

Experimental Validation

Application to Advanced Siames Trackers:
- Implementing these network architectures in SiamFC and SiamRPN frameworks showed considerable performance gains.
- In experiments on OTB, VOT-15, VOT-16, and VOT-17 datasets, the enhanced architectures outperformed baseline AlexNet by significant margins, with up to 9.8% improvement in AUC and 24.4% in EAO.
Ablation Studies:
- An extensive ablation analysis validated the essential role of CIR units and appropriate receptive field sizes.
- Shift from traditional residual units to CIR units and optimizing network stride resulted in consistent performance improvements across different network configurations.

Performance Metrics and Practical Implications

The results showed that the newly designed deeper and wider networks significantly boost the tracking performance without compromising real-time processing capabilities.
CIResNet-22, for instance, achieved up to 23.3% improvement in EAO on the VOT-17 dataset over the original SiamRPN framework. These advancements suggest that incorporating such refined network designs can serve practical applications in surveillance, robotics, and human-computer interactions where real-time decision-making is crucial.

Theoretical and Future Perspectives

The paper emphasizes the importance of designing network architectures tailored to the specific needs of visual tracking. It bridges the gap between purely classification-oriented deep networks and tracking-specific requirements.
Future research directions suggested by the paper include further optimization of network architecture parameters and exploration into hybrid models that could leverage both deep feature extraction and real-time processing further.

This paper's systematic approach to understanding and addressing the limitations of Siamese networks in visual tracking sets a significant precedent for how deep learning architectures might be custom-tailored for specific high-performance applications while maintaining operational efficiency.

PDF Markdown