MixFormer: End-to-End Tracking with Iterative Mixed Attention (2302.02814v2)

Published 6 Feb 2023 in cs.CV

Abstract: Visual object tracking often employs a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, in this paper, we present a compact tracking framework, termed as MixFormer, built upon transformers. Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration. This synchronous modeling scheme allows to extract target-specific discriminative features and perform extensive communication between target and search area. Based on MAM, we build our MixFormer trackers simply by stacking multiple MAMs and placing a localization head on top. Specifically, we instantiate two types of MixFormer trackers, a hierarchical tracker MixCvT, and a non-hierarchical tracker MixViT. For these two trackers, we investigate a series of pre-training methods and uncover the different behaviors between supervised pre-training and self-supervised pre-training in our MixFormer trackers. We also extend the masked pre-training to our MixFormer trackers and design the competitive TrackMAE pre-training technique. Finally, to handle multiple target templates during online tracking, we devise an asymmetric attention scheme in MAM to reduce computational cost, and propose an effective score prediction module to select high-quality templates. Our MixFormer trackers set a new state-of-the-art performance on seven tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10k, OTB100 and UAV123. In particular, our MixViT-L achieves AUC score of 73.3% on LaSOT, 86.1% on TrackingNet, EAO of 0.584 on VOT2020, and AO of 75.7% on GOT-10k. Code and trained models are publicly available at https://github.com/MCG-NJU/MixFormer.

Citations (359)

View on Semantic Scholar

Summary

The paper presents a unified tracking framework that integrates feature extraction and target processing via a mixed attention module.
It details the design of two tracker variants, MixCvT and MixViT, which enhance efficiency and achieve top benchmark performance.
The paper also explores novel pre-training strategies and online template updates to improve tracking robustness in dynamic environments.

Overview of MixFormer: End-to-End Tracking with Iterative Mixed Attention

The paper presents MixFormer, an innovative framework for visual object tracking, focusing on integrating feature extraction and target information processing within a unified model. Unlike traditional methods, which employ multi-stage pipelines, MixFormer leverages a transformer-based architecture with Mixed Attention Module (MAM) to streamline tracking operations, offering state-of-the-art performance across various benchmarks.

Core Contributions

1. Unified Tracking Framework:

The primary contribution of MixFormer is its compact design, eliminating the conventional separation between feature extraction and target integration. By utilizing MAM, MixFormer allows simultaneous processing that enhances target-specific feature extraction and improves information communication between the target and search areas. This design delivers a cleaner and more efficient pipeline, supporting end-to-end training.

2. Mixed Attention Module (MAM):

MAM acts as the core architectural block in MixFormer, performing dual-purpose attention: self-attention for extracting features from target or search areas, and cross-attention for interaction between them. This capability enables a more extensive dynamic modeling of relationships within the data.

3. Trackers Design:

Two iterations of MixFormer are introduced—MixCvT and MixViT. The former uses a hierarchical model based on Wide MAM, incorporating progressive downsampling for integrated local-global feature learning. The latter, MixViT, embraces a simpler non-hierarchical structure using Slimming MAM, optimized for speed and adaptability.

4. Pre-training Techniques:

The paper explores various pre-training strategies, including supervised and self-supervised methodologies. Of particular interest is the TrackMAE strategy that leverages masked autoencoders specifically trained on tracking datasets, showcasing competitive results without requiring large-scale datasets like ImageNet.

5. Online Template Update:

In addressing online tracking contexts, the framework proposes a Score Prediction Module for updating target templates, choosing high-quality templates to maintain robustness against object deformation and appearance variations.

Numerical Results and Benchmarks

MixFormer delivers significant improvements over existing tracking paradigms, achieving top performance in benchmarks such as LaSOT, TrackingNet, VOT2020, and GOT-10k. Notably, MixViT-L model achieves an AUC score of 73.3% on the LaSOT dataset, demonstrating its effectiveness in long-term tracking scenarios.

Implications and Future Directions

From a practical standpoint, MixFormer's streamlined tracking model offers enhanced efficiency and accuracy, making it highly suitable for real-time applications. The integration of transformers into tracking tasks paves the way for further exploration of attention mechanisms in other computer vision tasks. The promising results of TrackMAE also suggest future innovations in exploiting domain-specific pre-training strategies.

Theoretically, MixFormer's architecture encourages the development of more unified AI models, potentially influencing broader machine learning and computer vision applications. As future work, extensions to multiple object tracking or enhancement of template update mechanisms could provide even more robust solutions.

By advancing transformer-based tracking, MixFormer sets a new standard, offering compelling insights into the design of efficient, high-performance computational models.

PDF Markdown

Related Papers

GitHub

GitHub - MCG-NJU/MixFormer: [CVPR 2022 Oral & TPAMI 2024] MixFormer: End-to-End Tracking with Iterative Mixed Attention (464 stars)