MixFormer: End-to-End Tracking with Iterative Mixed Attention (2203.11082v2)

Published 21 Mar 2022 in cs.CV

Abstract: Tracking often uses a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, we present a compact tracking framework, termed as MixFormer, built upon transformers. Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration. This synchronous modeling scheme allows to extract target-specific discriminative features and perform extensive communication between target and search area. Based on MAM, we build our MixFormer tracking framework simply by stacking multiple MAMs with progressive patch embedding and placing a localization head on top. In addition, to handle multiple target templates during online tracking, we devise an asymmetric attention scheme in MAM to reduce computational cost, and propose an effective score prediction module to select high-quality templates. Our MixFormer sets a new state-of-the-art performance on five tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10k, and UAV123. In particular, our MixFormer-L achieves NP score of 79.9% on LaSOT, 88.9% on TrackingNet and EAO of 0.555 on VOT2020. We also perform in-depth ablation studies to demonstrate the effectiveness of simultaneous feature extraction and information integration. Code and trained models are publicly available at https://github.com/MCG-NJU/MixFormer.

Citations (381)

View on Semantic Scholar

Summary

The paper introduces the novel MixFormer framework that integrates feature extraction and target assimilation using an iterative mixed attention module.
It leverages an asymmetric attention strategy and transformer architecture to achieve state-of-the-art benchmarks, including 79.9% NP on LaSOT and 88.9% on TrackingNet.
The work offers practical implications for real-time tracking at 25 FPS and presents a compact, efficient model adaptable to broader computer vision tasks.

MixFormer: End-to-End Tracking with Iterative Mixed Attention

The paper "MixFormer: End-to-End Tracking with Iterative Mixed Attention" presents a novel framework in the field of visual object tracking, a core aspect of computer vision. The authors propose a transformative approach that integrates both feature extraction and target information assimilation into a unified process using a transformer-based architecture. The work addresses the traditional multi-stage tracking pipelines that involve separate stages for feature extraction and target integration.

Core Design and Methodology

The MixFormer framework leverages the flexibility of the transformer’s attention mechanism to perform simultaneous tasks of feature extraction and interaction over multiple processing stages. The central component of this framework is the Mixed Attention Module (MAM), which integrates self-attention and cross-attention operations. This approach enables a more seamless extraction of target-specific discriminative features, while concurrently facilitating communications between the target and the search area. The MAM allows the MixFormer to create a concise, end-to-end tracking model by stacking these modules, with a progressive patch embedding and a straightforward localization head.

In dealing with multiple target templates during online tracking, the authors propose an asymmetric attention scheme within MAM to optimize the computational cost, alongside a score prediction module aimed at selecting high-quality templates.

Numerical Results and Benchmarks

The MixFormer framework has demonstrated superior performance across several key tracking benchmarks, setting new state-of-the-art metrics. On the LaSOT dataset, MixFormer-L achieves an NP score of 79.9%, while on TrackingNet, it achieves 88.9%. In the Visual Object Tracking (VOT2020) challenge, they achieve an EAO of 0.555. These results represent a significant improvement over existing transformer-based trackers and demonstrate the efficacy of MixFormer's integrated approach.

Theoretical and Practical Implications

The implications of this research are manifold. Practically, MixFormer offers a compact, robust tracking framework capable of performing in real-time environments—operating at 25 FPS on a GTX 1080Ti GPU. Theoretically, it emphasizes the potential of transformer-based architectures in evolving beyond conventional object tracking paradigms towards more integrated, end-to-end models. The refinement of attention mechanisms to encompass simultaneous features and correlation extraction could be adapted across various computer vision tasks beyond tracking.

Future Directions

This work opens several avenues for future exploration. Given that MixFormer effectively combines feature extraction and target integration, future research could investigate its application to other sequential modeling tasks in vision. Additionally, exploring adaptive mechanisms for dynamic environments could broaden the framework's applications. The integration of MixFormer with more sophisticated localization heads could further enhance the accuracy and robustness of the framework.

In summary, MixFormer represents an advancement in the field of object tracking, leveraging iterative mixed attention to streamline and enhance the tracking pipeline, potentially setting a foundation for future developments in AI-driven computer vision.

PDF Markdown

Related Papers

GitHub

GitHub - MCG-NJU/MixFormer: [CVPR 2022 Oral & TPAMI 2024] MixFormer: End-to-End Tracking with Iterative Mixed Attention (467 stars)