Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework

Published 22 Mar 2022 in cs.CV | (2203.11991v4)

Abstract: The current popular two-stream, two-stage tracking framework extracts the template and the search region features separately and then performs relation modeling, thus the extracted features lack the awareness of the target and have limited target-background discriminability. To tackle the above issue, we propose a novel one-stream tracking (OSTrack) framework that unifies feature learning and relation modeling by bridging the template-search image pairs with bidirectional information flows. In this way, discriminative target-oriented features can be dynamically extracted by mutual guidance. Since no extra heavy relation modeling module is needed and the implementation is highly parallelized, the proposed tracker runs at a fast speed. To further improve the inference efficiency, an in-network candidate early elimination module is proposed based on the strong similarity prior calculated in the one-stream framework. As a unified framework, OSTrack achieves state-of-the-art performance on multiple benchmarks, in particular, it shows impressive results on the one-shot tracking benchmark GOT-10k, i.e., achieving 73.7% AO, improving the existing best result (SwinTrack) by 4.3\%. Besides, our method maintains a good performance-speed trade-off and shows faster convergence. The code and models are available at https://github.com/botaoye/OSTrack.

Abstract PDF Upgrade to Chat

Citations (337)

View on Semantic Scholar

Summary

The paper introduces OSTrack as a one-stream framework that concurrently performs feature extraction and relation modeling for improved tracking.
The transformer-based architecture enhances target discrimination via bidirectional guidance and early candidate elimination of background noise.
OSTrack achieves state-of-the-art results with a 73.7% average overlap and 58.1 FPS in challenging benchmarks.

Overview of One-Stream Framework for Object Tracking

The paper introduces a novel one-stream framework for visual object tracking, termed OSTrack, aiming to overcome limitations inherent in two-stream, two-stage tracking pipelines. The existing approaches in object tracking typically rely on separate stages for feature extraction and relation modeling, often leading to limited awareness and discrimination of target vs. background. OSTrack combines feature learning and relation modeling into a single unified framework, enabling a bidirectional flow of information between the template and search image pairs. This integration allows for the development of discriminative, target-oriented features through mutual guidance, resulting in enhanced performance and efficiency.

Technical Contributions

OSTrack is fundamentally a transformer-based approach that leverages the capability of Vision Transformers (ViT) for joint feature extraction and relation modeling. Key technical advancements include:

One-Stream Architecture: Contrary to the traditional two-stream trackers, OSTrack performs feature extraction and relation modeling concurrently by feeding concatenated template and search regions into several layers of a ViT. This reduces the complexity and enhances target-background discriminability by facilitating dynamic feature extraction.
Inference Efficiency: The framework’s design eliminates the need for a heavyweight relation modeling module, fostering a highly parallelizable implementation. This contributes to significant improvements in inference speed without compromising tracking accuracy.
Early Candidate Elimination: An innovative in-network module is introduced to progressively eliminate potential background candidates based on similarity scores acquired during initial tracking stages. This strategy not only enhances speed but also prevents background noise from skewing feature learning.

Results and Performance

OSTrack demonstrates state-of-the-art performance across multiple benchmarks, including challenging datasets such as GOT-10k, LaSOT, TrackingNet, and others. In the challenging GOT-10k one-shot setting, OSTrack achieves a high average overlap (AO) of 73.7%, outperforming previous models by a notable margin. Furthermore, the model achieves a commendable balance between accuracy and speed, with OSTrack-384 running at 58.1 FPS and maintaining robust tracking capabilities.

Implications and Future Directions

The proposed one-stream framework sets a new precedent in the domain of visual object tracking by efficiently integrating feature extraction and relation modeling. The results illustrate the potential for transformers in tracking tasks, emphasizing the efficiency and accuracy improvements gained through concurrent processing.

Moving forward, this one-stream paradigm may inspire further exploration into its application across other complex tasks in computer vision. Additionally, the introduction of in-network candidate elimination presents future possibilities in dynamic feature selection and real-time tracking adaptations. The simplification of the architecture maintaining flexibility and adaptability could also be explored in more specialized tracking scenarios or augmented with multi-modal data inputs.

Overall, the OSTrack approach presents a substantial advancement in object tracking methodologies, offering valuable insights into the power of transformer architectures and integrated model designs. Its practical deployment in advanced real-world scenarios, such as autonomous navigation or robotic vision systems, indicates promising opportunities for further research and application.