SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks (1812.11703v1)

Published 31 Dec 2018 in cs.CV

Abstract: Siamese network based trackers formulate tracking as convolutional feature cross-correlation between target template and searching region. However, Siamese trackers still have accuracy gap compared with state-of-the-art algorithms and they cannot take advantage of feature from deep networks, such as ResNet-50 or deeper. In this work we prove the core reason comes from the lack of strict translation invariance. By comprehensive theoretical analysis and experimental validations, we break this restriction through a simple yet effective spatial aware sampling strategy and successfully train a ResNet-driven Siamese tracker with significant performance gain. Moreover, we propose a new model architecture to perform depth-wise and layer-wise aggregations, which not only further improves the accuracy but also reduces the model size. We conduct extensive ablation studies to demonstrate the effectiveness of the proposed tracker, which obtains currently the best results on four large tracking benchmarks, including OTB2015, VOT2018, UAV123, and LaSOT. Our model will be released to facilitate further studies based on this problem.

Citations (1,695)

View on Semantic Scholar

Summary

The paper reveals that preserving translational invariance in deep networks is crucial, addressing padding issues with a spatial aware sampling strategy.
The study introduces innovative architectural enhancements like layer-wise feature aggregation and depth-wise cross correlation to boost both accuracy and efficiency.
Empirical results on benchmarks such as VOT2018 and OTB2015 confirm the tracker achieves superior real-time performance and state-of-the-art metrics.

SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks

The research paper, "SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks," addresses a prevalent challenge in the domain of visual tracking: leveraging the powerful feature extraction capabilities of very deep networks while maintaining the stringent translational invariance required in Siamese network-based trackers. The work encapsulates both theoretical analysis and empirical validations to bridge the performance gap between conventional Siamese trackers and state-of-the-art algorithms.

Among the significant contributions, the proposed paper identifies the crux of inefficiencies in existing Siamese trackers—primarily their inability to utilize deep networks like ResNet-50 effectively—stemming from the breakdown of strict translational invariance due to padding in deeper networks. The authors propose a spatial aware sampling strategy to address this bottleneck and introduce several innovative architectural advancements, enhancing both the computational efficiency and tracking accuracy.

Core Contributions

Deep Analysis of Translational Invariance: The paper explores the inherent translational invariance in Siamese networks. The comprehensive analysis elucidates that deeper networks with padding disrupt this key property, which is pivotal in maintaining localization accuracy. The authors validate their hypothesis through extensive simulated experiments which demonstrate that maintaining translational invariance, even within deep networks, significantly impacts tracker performance.
Architectural Enhancements: The paper outlines two main architectural enhancements:
- Layer-wise Feature Aggregation: This involves leveraging diverse features from multiple layers of a deep network. The paper justifies that early layers capture low-level features crucial for localization, while deeper layers encapsulate higher-level semantics valuable for distinguishing object appearances under challenging conditions.
- Depth-wise Cross Correlation (DW-XCorr): The standard cross-correlation used in Siamese trackers is augmented to a depth-wise separable correlation structure. This reduces parameter redundancy, stabilizes training by balancing between the two branches of the network, and enhances computational efficiency without sacrificing accuracy.
Empirical Validation: The authors provide empirical results across several standard benchmarks, including OTB2015, VOT2018, UAV123, LaSOT, and TrackingNet. The SiamRPN++ tracker achieves the best performance in various metrics such as Expected Average Overlap (EAO) and tracking accuracy, clearly demonstrating the efficacy of the proposed method.

Numerical Results and Benchmarks

The experimental results highlight the robustness and efficiency of the SiamRPN++ tracker. For instance, on VOT2018, the tracker achieves an EAO score of 0.414, surpassing the state-of-the-art trackers like LADCF (0.389) and DaSiamRPN (0.326). On the OTB2015, it attains a success score of 0.696, representing significant improvement over its predecessors. These performance metrics illustrate the practical applicability of the proposed enhancements. Furthermore, the analysis on computational efficiency reveals that SiamRPN++ maintains real-time performance, running at 35 FPS on a GPU, with faster variants reaching up to 70 FPS.

Practical and Theoretical Implications

From a practical standpoint, the SiamRPN++ tracker sets a new benchmark for real-time, highly accurate visual tracking. It allows for implementations where both accuracy and speed are critical, such as in autonomous vehicles, UAVs, and real-time surveillance systems.

Theoretically, the contributions extend the use of very deep networks in visual tracking and demonstrate the viability of breaking translational invariance constraints with adequate compensation mechanisms. The introduction of the DW-XCorr expands on cross-correlation's applicability beyond basic tracking to more complex vision tasks, thereby presenting an interesting avenue for future research.

Speculations on Future Developments

In light of the advancements presented, future developments could explore:

Integration with other deep learning paradigms, such as combining SiamRPN++ with reinforcement learning to dynamically adapt to changing target appearances.
Exploration of more efficient network architectures tailored for mobile or edge devices.
Expanding the spatial aware sampling strategy to handle rotational and scale invariances for more robust tracking under perspective changes.

Conclusion

The SiamRPN++ epitomizes a crucial stride in leveraging the full potential of very deep networks for visual tracking. By addressing the underlying issues of translational invariance and proposing noteworthy architectural innovations, the paper contributes significantly to the field. As a result, the broader impact of this work is evidenced not only in its empirical performance but also in setting the groundwork for future research directions that further abstract and apply deep learning methodologies in visual tracking and beyond.

PDF Markdown