TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking (2104.00194v2)

Published 1 Apr 2021 in cs.CV

Abstract: Tracking multiple objects in videos relies on modeling the spatial-temporal interactions of the objects. In this paper, we propose a solution named TransMOT, which leverages powerful graph transformers to efficiently model the spatial and temporal interactions among the objects. TransMOT effectively models the interactions of a large number of objects by arranging the trajectories of the tracked objects as a set of sparse weighted graphs, and constructing a spatial graph transformer encoder layer, a temporal transformer encoder layer, and a spatial graph transformer decoder layer based on the graphs. TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy. To further improve the tracking speed and accuracy, we propose a cascade association framework to handle low-score detections and long-term occlusions that require large computational resources to model in TransMOT. The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20, and it achieves state-of-the-art performance on all the datasets.

Citations (160)

View on Semantic Scholar

Summary

The paper introduces TransMOT, a novel approach using a spatial-temporal graph transformer and sparse weighted graphs to model object interactions for Multiple Object Tracking.
TransMOT employs distinct spatial and temporal transformer encoders plus a spatial decoder, enhanced by a cascade association framework for better handling low-confidence detections and occlusions.
Evaluations show TransMOT achieves state-of-the-art tracking accuracy (IDF1, MOTA) and improves efficiency on multiple benchmarks compared to existing methods.

Overview of TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking

The paper introduces TransMOT, a novel approach to enhance the efficiency and accuracy of Multiple Object Tracking (MOT) using a spatial-temporal graph transformer model. The key challenge addressed is the need to effectively capture both spatial and temporal interactions among multiple objects in video sequences. TransMOT innovatively organizes object trajectories into sparse weighted graphs, integrating a spatial graph transformer encoder, a temporal transformer encoder, and a spatial graph transformer decoder. This design results in a model that is more computationally efficient than traditional transformer models and yields improved tracking accuracy.

Methodological Contributions

Graph-Based Representation: TransMOT leverages a graph-based approach wherein object trajectories are arranged as a series of sparse weighted graphs. This methodology enables efficient modeling of interactions among a potentially large number of objects by precisely capturing their spatial and temporal relationships.
Spatial-Temporal Graph Transformer: The model includes a spatial graph transformer encoder that focuses on encoding spatial relationships, a temporal transformer encoder for temporal dependencies, and a spatial graph transformer decoder for reconstructing and mapping detected candidates to tracked trajectories.
Cascade Association Framework: The addition of a cascade association framework significantly enhances the model's performance in handling low-confidence detections and managing long-term occlusions. By decomposing the association process into stages, the framework streamlines the computational load and improves tracking speed and accuracy.

Results and Performance

Experimental evaluations on multiple benchmark datasets (MOT15, MOT16, MOT17, and MOT20) demonstrate that TransMOT achieves state-of-the-art performance across several key metrics including IDF1 and MOTA. Notably, it demonstrates robust tracking in both public and private detection scenarios, evidencing its robustness to variations in detection quality and density of tracked objects.

Effectiveness: TransMOT consistently outperforms existing models, improving IDF1 scores and reducing identity switches (IDS). The incorporation of sparse graph representations and the bifurcation of spatial and temporal processing appears to contribute significantly to these improvements.
Efficiency: The use of graph structures and a cascade association framework allows TransMOT to handle substantial computational tasks with less resource consumption compared to conventional transformers. This makes it viable for real-time applications where computational resources are constrained.

Implications and Future Directions

TransMOT represents a significant advance in the domain of MOT by integrating transformer architectures with graph-based modeling of spatial-temporal relationships. The implications are wide-ranging, from enhanced video surveillance systems to improved autonomous vehicle navigation.

Future research could explore:

Integration with Advanced Object Detectors: Exploring interoperability with cutting-edge object detectors could further enhance detection and tracking synergy.
Expansion to Other Types of Objects and Scenes: Generalizing the model to track various object types in diverse environments beyond pedestrian tracking.
Scalability and Resource Efficiency: Further optimization of computational efficiency and scalability for handling ultra high-resolution data or extremely dense scenes.

TransMOT effectively demonstrates the potential of graph-enhanced transformer models in achieving high-performance MOT, paving the way for future innovations in dynamic scene understanding and autonomous environmental interaction.