TrackFormer: Multi-Object Tracking with Transformers (2101.02702v3)

Published 7 Jan 2021 in cs.CV

Abstract: The challenging task of multi-object tracking (MOT) requires simultaneous reasoning about track initialization, identity, and spatio-temporal trajectories. We formulate this task as a frame-to-frame set prediction problem and introduce TrackFormer, an end-to-end trainable MOT approach based on an encoder-decoder Transformer architecture. Our model achieves data association between frames via attention by evolving a set of track predictions through a video sequence. The Transformer decoder initializes new tracks from static object queries and autoregressively follows existing tracks in space and time with the conceptually new and identity preserving track queries. Both query types benefit from self- and encoder-decoder attention on global frame-level features, thereby omitting any additional graph optimization or modeling of motion and/or appearance. TrackFormer introduces a new tracking-by-attention paradigm and while simple in its design is able to achieve state-of-the-art performance on the task of multi-object tracking (MOT17 and MOT20) and segmentation (MOTS20). The code is available at https://github.com/timmeinhardt/trackformer .

Authors (4)

Tim Meinhardt (13 papers)
Alexander Kirillov (27 papers)
Laura Leal-Taixe (100 papers)
Christoph Feichtenhofer (52 papers)

Citations (667)

View on Semantic Scholar

Summary

The paper introduces a novel transformer-based approach that redefines multi-object tracking as a frame-to-frame set prediction problem.
It employs static object queries for track initialization and autoregressive queries to maintain trajectories without complex post-processing.
The method achieves state-of-the-art results on MOT17, MOT20, and MOTS20 benchmarks by streamlining the detection and tracking pipeline.

TrackFormer: Multi-Object Tracking with Transformers

Overview

The paper presents TrackFormer, a novel approach to multi-object tracking (MOT) using a Transformer-based architecture. The authors redefine MOT as a frame-to-frame set prediction problem, introducing a new paradigm named tracking-by-attention. Unlike traditional methods that rely on tracking-by-detection or regression, TrackFormer leverages an encoder-decoder Transformer to achieve end-to-end trainability for both detection and tracking tasks. Key innovations include the use of static object queries to initiate tracks and autoregressive track queries that maintain and follow the trajectories of existing tracks.

Methodology

TrackFormer operates through a series of well-defined steps orchestrated by a Transformer model:

Feature Extraction: A convolutional neural network (CNN) processes frame-level features.
Encoding: A Transformer encoder applies self-attention to these features.
Decoding: A Transformer decoder uses a combination of self- and encoder-decoder attention to transform queries into output embeddings.
Prediction: These outputs are mapped to bounding boxes and class predictions via multilayer perceptrons (MLPs).

The novel contribution of track queries facilitates frame-to-frame data association without additional graph optimization, effectively implementing a seamless, attention-based tracking mechanism.

Results

The model achieves state-of-the-art performance on major MOT benchmarks, MOT17 and MOT20, and exhibits notable success in the MOTS20 segmentation challenge. Specifically:

MOT17: Achieves a MOTA of 74.1 and IDF1 of 68.0 when trained on the CrowdHuman dataset, surpassing prior methods that necessitated additional data and heuristics.
MOT20: Competitively matches or exceeds methods trained on substantially larger datasets.
MOTS20: Provides top results in both segmentation accuracy (MOTSA) and identity preservation (IDF1).

Implications

The TrackFormer framework demonstrates the efficacy of Transformers in addressing MOT by casting tracking as a set prediction problem. It challenges conventional object tracking paradigms that rely heavily on post-processing steps for association and identity management. This approach simplifies the MOT pipeline and highlights the potential for Transformers in delivering holistic solutions across tasks typically handled separately.

Future Directions

TrackFormer opens new avenues for leveraging attention mechanisms in temporal sequence management:

Extended Transformer Capabilities: By incorporating more advanced forms of spatial and temporal reasoning, further improvements in identity consistency and track robustness can be investigated.
Scalability and Complexity: Addressing computational demands by optimizing attention computations may enhance real-time applicability.
Integration with Segmentation Tasks: The incorporation of instance-level segmentation reveals opportunities for concurrent advancements in tracking-by-segmentation paradigms that utilize detailed pixel-level information.

In summary, TrackFormer exemplifies a step towards unified and simplified multi-object tracking solutions and sets a new baseline for future research leveraging deep learning architectures for complex temporal tasks.

PDF Markdown

Related Papers

GitHub

GitHub - timmeinhardt/trackformer: Implementation of "TrackFormer: Multi-Object Tracking with Transformers”. [Conference on Computer Vision and Pattern Recognition (CVPR), 2022] (516 stars)

YouTube

Show All Videos