Deformable Siamese Attention Networks for Visual Object Tracking (2004.06711v2)

Published 14 Apr 2020 in cs.CV

Abstract: Siamese-based trackers have achieved excellent performance on visual object tracking. However, the target template is not updated online, and the features of the target template and search image are computed independently in a Siamese architecture. In this paper, we propose Deformable Siamese Attention Networks, referred to as SiamAttn, by introducing a new Siamese attention mechanism that computes deformable self-attention and cross-attention. The self attention learns strong context information via spatial attention, and selectively emphasizes interdependent channel-wise features with channel attention. The cross-attention is capable of aggregating rich contextual inter-dependencies between the target template and the search image, providing an implicit manner to adaptively update the target template. In addition, we design a region refinement module that computes depth-wise cross correlations between the attentional features for more accurate tracking. We conduct experiments on six benchmarks, where our method achieves new state of-the-art results, outperforming the strong baseline, SiamRPN++ [24], by 0.464->0.537 and 0.415->0.470 EAO on VOT 2016 and 2018. Our code is available at: https://github.com/msight-tech/research-siamattn.

Citations (294)

View on Semantic Scholar

Summary

The paper introduces SiamAttn, a model that integrates deformable self- and cross-attention to dynamically refine target representations for more robust tracking.
It employs a region refinement module with depth-wise cross correlation, resulting in significant improvements in expected average overlap across benchmarks.
Experimental results reveal that SiamAttn outperforms SiamRPN++ (e.g., VOT2016 EAO increased from 0.464 to 0.537), suggesting enhanced tracking accuracy under challenging conditions.

Deformable Siamese Attention Networks for Visual Object Tracking: An Expert Overview

The paper "Deformable Siamese Attention Networks for Visual Object Tracking" introduces a noteworthy advancement in the domain of visual object tracking by leveraging the concept of Siamese networks integrated with a novel attention mechanism. This work proposes an architecture dubbed as SiamAttn, which aims to address some of the intrinsic constraints of conventional Siamese-based trackers such as their inability to update the target template online and the independent computation of features for the target template and search image.

Core Contributions and Methodology

SiamAttn introduces a deformable Siamese attention mechanism, fundamentally enhanced by incorporating both self-attention and cross-attention, which together facilitate robust and adaptive feature representation. The introduced self-attention mechanism is adept at capturing spatial and channel-wise dependencies, enabling a more enriched context understanding. Conversely, cross-attention aggregates interdependencies between the target and search images, refining the target template dynamically and enhancing its discriminability against complex backgrounds and close distractors—common challenges in visual tracking tasks.

The architecture also includes a region refinement module that performs depth-wise cross correlation to refine tracking predictions with improved accuracy. This module is particularly engineered to perform alongside the main Siamese networks, ensuring the bounding box predictions and masks are calculated with precision.

Experimental Validation and Results

The SiamAttn model was rigorously evaluated across six benchmarks: OTB-2015, VOT2016, VOT2018, UAV123, LaSOT, and TrackingNet. The results indicate that SiamAttn consistently outperforms existing state-of-the-art methods like SiamRPN++, particularly in expected average overlap (EAO) measured on the VOT benchmarks. For instance, on VOT2016, the proposed architecture achieved a notable EAO improvement from 0.464 in SiamRPN++ to 0.537. Such performance enhancements are attributed to the newly introduced attention mechanisms which provide significant robustness against variations in object appearance and challenging tracking scenarios.

Implications and Future Prospects

The introduction of a deformable Siamese attention network in visual object tracking deepens the understanding and application of attention mechanisms in tracking contexts, suggesting a shift toward more adaptive and contextually aware models. The improved robustness in challenging scenarios indicates potential applications in real-world tasks such as autonomous driving and human-computer interactions, where real-time and precise tracking is imperative.

Future research directions would plausibly involve expanding the adaptability of the SiamAttn framework to more diverse environments and further optimizing the computational efficiency. As the field progresses, it's reasonable to anticipate more refined models that seamlessly integrate such attention mechanisms with real-time applications, potentially influencing adjacent domains such as augmented reality and robotics.

In conclusion, the proposed SiamAttn model emerges as a significant contribution to the field of visual object tracking, offering both practical enhancements in performance and theoretical advancements in network design, opening avenues for further exploration and application in adaptive tracking methods.

Related Papers

Transformer Tracking (2021)
AiATrack: Attention in Attention for Transformer Visual Tracking (2022)
Graph Attention Tracking (2020)
Siamese Box Adaptive Network for Visual Tracking (2020)
Distractor-aware Siamese Networks for Visual Object Tracking (2018)