Global Tracking Transformers (2203.13250v2)

Published 24 Mar 2022 in cs.CV

Abstract: We present a novel transformer-based architecture for global multi-object tracking. Our network takes a short sequence of frames as input and produces global trajectories for all objects. The core component is a global tracking transformer that operates on objects from all frames in the sequence. The transformer encodes object features from all frames, and uses trajectory queries to group them into trajectories. The trajectory queries are object features from a single frame and naturally produce unique trajectories. Our global tracking transformer does not require intermediate pairwise grouping or combinatorial association, and can be jointly trained with an object detector. It achieves competitive performance on the popular MOT17 benchmark, with 75.3 MOTA and 59.1 HOTA. More importantly, our framework seamlessly integrates into state-of-the-art large-vocabulary detectors to track any objects. Experiments on the challenging TAO dataset show that our framework consistently improves upon baselines that are based on pairwise association, outperforming published works by a significant 7.7 tracking mAP. Code is available at https://github.com/xingyizhou/GTR.

Citations (125)

View on Semantic Scholar

Summary

The paper presents a novel Global Tracking Transformer that generates holistic object trajectories by using trajectory queries across video frames.
It seamlessly integrates state-of-the-art object detectors into an end-to-end system with a lightweight transformer architecture for rapid inference.
The model achieves 75.3 MOTA on MOT17 and a 7.7 point mAP improvement on TAO, demonstrating significant performance gains in multi-object tracking.

Global Tracking Transformers: A Technical Overview

The paper "Global Tracking Transformers" introduces a novel transformer-based framework designed for global multi-object tracking (MOT). Developed by researchers at The University of Texas at Austin and Apple, this approach is a significant stride in leveraging transformers for tracking applications.

Core Contributions

The key innovation in this research is the Global Tracking Transformer (GTR). The architecture uses a sequence of video frames as input to generate cohesive object trajectories. Unlike traditional methods that rely on pairwise or combinatorial associations, GTR employs trajectory queries derived from object features in a single frame to organize these features into holistic trajectories across frames.

Methodology

The framework integrates seamlessly with state-of-the-art object detectors, transforming them into a combined detection and tracking system. The GTR processes detections from multiple frames, using a transformer encoder to encode object features and a cross-attention mechanism to link these features with trajectory queries. This eliminates the need for intermediate pairing steps. During training, the network leverages ground-truth trajectory data and is trained end-to-end.

The model operates within a temporal window of 32 frames and utilizes a sliding window approach during inference to predict object trajectories. It is efficient, with a lightweight transformer architecture that includes just one encoder and one decoder layer, facilitating rapid inference times.

Results

On the MOT17 benchmark, GTR achieves a compelling 75.3 MOTA and 59.1 HOTA. It notably surpasses existing models on the TAO dataset by improving tracking mAP by 7.7 points. The paper highlights the model's ability to handle large-vocabulary detectors, enhancing its flexibility in tracking diverse objects.

Implications and Future Directions

The practical implications of GTR are significant for applications such as robotics and autonomous systems, where understanding dynamic environments is crucial. Theoretically, this work advances the integration of transformers in tracking tasks, traditionally dominated by convolutional architectures.

Future developments in AI might see expanded use of such architectures, potentially exploring larger temporal windows and richer feature extraction methods. Additionally, integrating learned positional embeddings and optimizing multi-class tracking training sets could provide further gains.

Conclusion

The introduction of Global Tracking Transformers marks a meaningful development in MOT, balancing performance with efficiency. By addressing key drawbacks of current approaches, such as reliance on pairwise associations, this research sets the stage for more robust and flexible tracking solutions.

PDF Markdown

Related Papers

GitHub

GitHub - xingyizhou/GTR: Global Tracking Transformers, CVPR 2022 (374 stars)