Efficient Visual Tracking with Exemplar Transformers (2112.09686v4)

Published 17 Dec 2021 in cs.CV

Abstract: The design of more complex and powerful neural network models has significantly advanced the state-of-the-art in visual object tracking. These advances can be attributed to deeper networks, or the introduction of new building blocks, such as transformers. However, in the pursuit of increased tracking performance, runtime is often hindered. Furthermore, efficient tracking architectures have received surprisingly little attention. In this paper, we introduce the Exemplar Transformer, a transformer module utilizing a single instance level attention layer for realtime visual object tracking. E.T.Track, our visual tracker that incorporates Exemplar Transformer modules, runs at 47 FPS on a CPU. This is up to 8x faster than other transformer-based models. When compared to lightweight trackers that can operate in realtime on standard CPUs, E.T.Track consistently outperforms all other methods on the LaSOT, OTB-100, NFS, TrackingNet, and VOT-ST2020 datasets. Code and models are available at https://github.com/pblatter/ettrack.

Authors (4)

Philippe Blatter (1 paper)
Menelaos Kanakis (12 papers)
Martin Danelljan (96 papers)
Luc Van Gool (570 papers)

Citations (62)

View on Semantic Scholar

Summary

Efficient Visual Tracking with Exemplar Transformers

The paper in question introduces a novel approach to visual tracking through the development and utilization of Exemplar Transformers. It addresses the critical balance between accuracy and computational efficiency in visual object tracking, focusing specifically on enhancing real-time performance without sacrificing tracking quality. The authors propose a unique transformer module known as the Exemplar Transformer, which operates using a single instance-level attention layer, coined as Exemplar Attention. This module is designed to streamline and expedite the tracking process on computationally limited hardware, such as standard CPUs, while improving tracking capabilities.

The core of this research is the Exemplar Transformer, which fundamentally differs from standard transformer architectures by leveraging an instance-specific attention mechanism. Traditional transformers often suffer from high computational costs due to the need for global attention mechanisms throughout feature maps. In contrast, the Exemplar Transformer simplifies this process by reducing the attention calculation to a single global query and a small set of exemplar keys, effectively acting as a shared memory across dataset samples. This allows the tracker to maintain high fidelity in object representation while significantly reducing the computational burden.

The authors implement this innovation within a tracking pipeline referred to as E.T.Track. This system integrates the Exemplar Transformer layer into a Siamese tracking architecture, specifically replacing convolutional operations within the tracker head to enhance the expressiveness and accuracy of the model. Notably, E.T.Track achieves a speed of 47 FPS on a CPU, which is up to eight times faster than many existing transformer-based models, while also outperforming state-of-the-art lightweight trackers on datasets such as LaSOT, OTB-100, and TrackingNet.

The experimental results presented in the paper show a marked improvement across several benchmarks. E.T.Track achieves notable gains in the area under the curve (AUC) metric across various datasets, indicating robust performance and generalization. Specifically, it reports a 59.1% AUC on the LaSOT dataset, setting a new benchmark for real-time CPU trackers. Furthermore, the use of Exemplar Transformers effectively bridges the performance gap between traditional convolutional approaches and transformer models typically limited to more powerful hardware.

The implications of this research are multifold. Practically, it enables real-time object tracking on consumer-grade hardware, broadening the accessibility and applicability of high-performance tracking systems in fields such as robotics, autonomous driving, and human-computer interaction. Theoretically, it opens new avenues for exploring tailored transformer architectures across various computer vision tasks, encouraging the design of application-specific attention mechanisms that optimize both computational efficiency and task performance.

In future developments, expanding the Exemplar Transformer framework to multi-object tracking paradigms, further reducing computational overhead, and integrating such models into broader AI systems for episodic and continual learning tasks could be considered. Moreover, cross-disciplinary applications where real-time scene understanding is crucial might benefit from the enhanced speed and efficiency provided by this approach, potentially leading to new industry standards in visual tracking technologies.

PDF Markdown

Related Papers

GitHub

GitHub - pblatter/ettrack: Efficient Visual Tracking with Exemplar Transformers [WACV2023] (88 stars)

Tweets

https://twitter.com/MuzafferKal_/status/1476836648620986368