TrTr: Visual Tracking with Transformer (2105.03817v1)

Published 9 May 2021 in cs.CV

Abstract: Template-based discriminative trackers are currently the dominant tracking methods due to their robustness and accuracy, and the Siamese-network-based methods that depend on cross-correlation operation between features extracted from template and search images show the state-of-the-art tracking performance. However, general cross-correlation operation can only obtain relationship between local patches in two feature maps. In this paper, we propose a novel tracker network based on a powerful attention mechanism called Transformer encoder-decoder architecture to gain global and rich contextual interdependencies. In this new architecture, features of the template image is processed by a self-attention module in the encoder part to learn strong context information, which is then sent to the decoder part to compute cross-attention with the search image features processed by another self-attention module. In addition, we design the classification and regression heads using the output of Transformer to localize target based on shape-agnostic anchor. We extensively evaluate our tracker TrTr, on VOT2018, VOT2019, OTB-100, UAV, NfS, TrackingNet, and LaSOT benchmarks and our method performs favorably against state-of-the-art algorithms. Training code and pretrained models are available at https://github.com/tongtybj/TrTr.

Citations (75)

View on Semantic Scholar

Summary

The paper introduces a novel Transformer encoder-decoder architecture that captures global contextual dependencies, enhancing template-based visual tracking.
The design features shape-agnostic anchoring in classification and regression, moving beyond traditional, shape-dependent methods.
Evaluations on multiple benchmarks, including VOT and LaSOT, demonstrate robust performance that challenges existing tracking paradigms.

Overview of TrTr: Visual Tracking with Transformer

Visual tracking has evolved significantly with advances in neural network architectures, particularly with the introduction of Siamese networks that leverage cross-correlation operations to achieve state-of-the-art performance. This paper introduces "TrTr", a novel visual tracking approach based on the Transformer encoder-decoder architecture, enhancing the traditional template-based discriminative tracking methods. Transformer models, renowned for their attention mechanisms, are utilized here to capture global and rich contextual interdependencies, addressing limitations inherent to cross-correlation operations which focus only on local patch relationships.

Key Contributions

TrTr is pioneering in its application of Transformer architectures for visual tracking, implementing a unique design that is not derivative of existing models. The key innovations include:

Transformer Encoder-Decoder Architecture: The architecture processes template image features with a self-attention module in the encoder, facilitating the capture of strong context information which is crucial for robust tracking. The encoder's output, rich in global contextual information, interfaces with the decoder to perform cross-attention with similarly processed search image features.
Shape-Agnostic Anchoring: The classification and regression heads employ outputs from the Transformer, designed to localize targets based on shape-agnostic anchors. This design choice underscores a departure from conventional anchor-based methods which are typically shape-dependent, offering increased flexibility and accuracy in tracking diverse object shapes.

Evaluation and Results

The efficacy of TrTr is validated through extensive evaluations across multiple benchmarks, including VOT2018, VOT2019, OTB-100, UAV, NfS, TrackingNet, and LaSOT. The results demonstrate that TrTr consistently performs favorably against state-of-the-art tracking algorithms, underscoring its robust and accurate tracking capabilities. The performance metrics across these benchmarks illustrate significant improvements, challenging existing paradigms within the domain of visual tracking.

Implications and Future Directions

TrTr's approach using Transformers for visual tracking has several implications:

Practical Enhancements: The ability of Transformer architectures to capture global dependencies can lead to more efficient and reliable tracking systems, particularly in complex and dynamic environments where traditional methods may falter.
Theoretical Advancements: The adoption of attention mechanisms in tracking tasks can stimulate further research into how high-dimensional feature interdependencies can be utilized for tasks beyond tracking, potentially influencing image and video analysis techniques.

Future research should explore the scaling of such architectures to accommodate multi-object tracking scenarios and real-time applications, where computational efficiency becomes paramount. Additionally, extending such attention-based mechanisms to learn from fewer labeled instances could open doors to more generalized solutions across varied computer vision tasks. The availability of training code and pretrained models on GitHub presents an opportunity for the research community to build upon and further refine the TrTr model.

PDF Markdown

Related Papers

GitHub

GitHub - tongtybj/TrTr: TrTr: Visual Tracking with Transformer (80 stars)