Learning Spatio-Temporal Transformer for Visual Tracking (2103.17154v1)

Published 31 Mar 2021 in cs.CV

Abstract: In this paper, we present a new tracking architecture with an encoder-decoder transformer as the key component. The encoder models the global spatio-temporal feature dependencies between target objects and search regions, while the decoder learns a query embedding to predict the spatial positions of the target objects. Our method casts object tracking as a direct bounding box prediction problem, without using any proposals or predefined anchors. With the encoder-decoder transformer, the prediction of objects just uses a simple fully-convolutional network, which estimates the corners of objects directly. The whole method is end-to-end, does not need any postprocessing steps such as cosine window and bounding box smoothing, thus largely simplifying existing tracking pipelines. The proposed tracker achieves state-of-the-art performance on five challenging short-term and long-term benchmarks, while running at real-time speed, being 6x faster than Siam R-CNN. Code and models are open-sourced at https://github.com/researchmm/Stark.

Citations (610)

View on Semantic Scholar

Summary

The paper introduces a transformer model that directly predicts bounding boxes without relying on proposals, streamlining the tracking process.
The encoder-decoder structure with a corner-based prediction head captures global spatio-temporal dependencies, achieving gains of 3.9% in AO and 2.3% in success rates.
The method operates at real-time speeds of 30 fps on a Tesla V100 GPU, offering a simpler and more efficient solution compared to traditional approaches.

Learning Spatio-Temporal Transformer for Visual Tracking

This paper introduces a novel visual tracking architecture by leveraging an encoder-decoder transformer model to enhance tracking performance. The authors propose a method that directly predicts bounding boxes for target objects without relying on proposals or predefined anchors, which significantly simplifies the conventional tracking pipeline.

Key Components and Contributions

The tracking model is primarily composed of an encoder, a decoder, and a corner-based prediction head. The encoder captures global spatio-temporal dependencies by processing the entire input sequence, which includes the initial target, the current image, and an updated dynamic template. The decoder focuses on adapting a query embedding to predict the spatial positioning of target objects in the search region. The prediction head employs a fully convolutional network to directly estimate the corners of the bounding boxes.

Spatio-Temporal Integration

The integration of both spatial and temporal data is a pivotal aspect of the proposed methodology. While traditional Siamese trackers and online methods separately handle spatial and temporal components, this model utilizes a transformer to unify these aspects, achieving robust spatio-temporal feature learning. The dynamically updated template aids in adapting to appearance variations over time, enhancing the model’s accuracy and reliability in tracking moving objects.

Numerical Results and Performance

The proposed method demonstrates superior performance on five benchmarks, achieving state-of-the-art results in both short-term and long-term tracking scenarios. Highlighted are gains of 3.9% in average overlap (AO) score and 2.3% in success on the GOT-10K and LaSOT datasets, respectively, when compared to the Siam R-CNN. Additionally, the introduced tracker operates at real-time speeds of 30 fps on a Tesla V100 GPU, which is six times faster than some previous leading models, demonstrating significant efficiency improvements.

Simplified Tracking Pipeline

One of the method's strengths is the elimination of complex post-processing steps such as cosine window adjustments and bounding box smoothing. By adopting an end-to-end learning approach with the transformer architecture, the tracking pipeline becomes notably simpler while maintaining, and indeed enhancing, robust performance.

Theoretical and Practical Implications

The paper’s findings have substantial implications for both theoretical research and practical application in AI-driven visual tracking. The employment of transformer architectures in computer vision tasks may inspire further exploration into leveraging such models for other complex visual tasks. Practically, the real-time tracking capabilities and accurate object localization without extensive post-processing make this model highly applicable in scenarios requiring efficient and precise tracking.

Future Prospects

Looking forward, this work sets a foundational framework that could be expanded upon with advancements in transformer designs or further spatio-temporal feature fusion techniques. The exploration of even more lightweight architectures or the integration of additional modalities could push the boundaries of current tracking capabilities.

This paper contributes significantly to the visual tracking field by not only demonstrating the efficacy of transformers in this domain but also by paving the way for future research endeavors that exploit the rich potentials of spatio-temporal data integration.

PDF Markdown