MOTR: End-to-End Multiple-Object Tracking with Transformer (2105.03247v4)

Published 7 May 2021 in cs.CV

Abstract: Temporal modeling of objects is a key challenge in multiple object tracking (MOT). Existing methods track by associating detections through motion-based and appearance-based similarity heuristics. The post-processing nature of association prevents end-to-end exploitation of temporal variations in video sequence. In this paper, we propose MOTR, which extends DETR and introduces track query to model the tracked instances in the entire video. Track query is transferred and updated frame-by-frame to perform iterative prediction over time. We propose tracklet-aware label assignment to train track queries and newborn object queries. We further propose temporal aggregation network and collective average loss to enhance temporal relation modeling. Experimental results on DanceTrack show that MOTR significantly outperforms state-of-the-art method, ByteTrack by 6.5% on HOTA metric. On MOT17, MOTR outperforms our concurrent works, TrackFormer and TransTrack, on association performance. MOTR can serve as a stronger baseline for future research on temporal modeling and Transformer-based trackers. Code is available at https://github.com/megvii-research/MOTR.

Authors (6)

Fangao Zeng (5 papers)
Bin Dong (111 papers)
Yuang Zhang (18 papers)
Tiancai Wang (48 papers)
Xiangyu Zhang (328 papers)
Yichen Wei (47 papers)

Citations (432)

View on Semantic Scholar

Summary

An Analysis of "MOTR: End-to-End Multiple-Object Tracking with Transformer"

The paper "MOTR: End-to-End Multiple-Object Tracking with Transformer" presents a novel methodology implemented to address temporal modeling challenges in multiple-object tracking (MOT). The authors propose an end-to-end tracking framework, expanding upon the Detection Transformer (DETR) architecture. Specifically, their solution integrates a "track query" mechanism, enabling the iterative prediction of object trajectories across video frames.

Key Contributions and Methodology

The paper makes significant strides towards refining sequence prediction in the context of MOT using a fully end-to-end approach. The key contributions include:

Introduction of Track Query:
- The authors extend DETR by incorporating track queries, which serve as hidden states for object tracks. This allows the representation of tracked instances to be updated frame-by-frame in a coherent manner.
Tracklet-Aware Label Assignment (TALA):
- TALA is a strategy for supervising track queries by bounding box sequences with consistent identities. It partitions the label assignment into newborn detection and continuous tracking, thus resolving issues with previous methods that heavily relied on post-processing heuristics.
Temporal Aggregation Network (TAN):
- TAN is designed to enhance temporal relation modeling by aggregating historical information from previous states of track queries. This addition aims to enrich contextual priors which are crucial for accurate tracking.
Collective Average Loss (CAL):
- CAL facilitates the video clip-based training of MOTR, optimizing the model based on the aggregated loss from multiple frames. This training strategy captures long-range object dynamics more effectively than traditional frame-to-frame approaches.

Experimental Results

The empirical evaluation of MOTR is conducted on multiple datasets, including DanceTrack, MOT17, and BDD100k, with specific attention to diverse motion patterns and object interactions.

DanceTrack: MOTR outperforms existing methodologies such as ByteTrack and TransTrack, with a notable 6.5% improvement over ByteTrack on the Higher Order Tracking Accuracy (HOTA) metric.
MOT17: While the results on MOT17 are less striking, primarily because this dataset favors high detection performance, MOTR shows relative improvement over other Transformer-based approaches like TrackFormer and TransTrack.
BDD100k: Demonstrating the generalizability of MOTR, the model proved effective in multi-class scenarios, outperforming prior methods in mean MOTA and reducing ID switches significantly.

Theoretical and Practical Implications

The theoretical implications of MOTR are multi-faceted:

Unified Architecture: By leveraging the power of Transformers, MOTR unifies detection and tracking within a single architecture, minimizing the dependency on heuristic-based post-processing.
Temporal and Spatial Context: The integration of TAN and CAL underscores the importance of both spatial and temporal coherency in tracking models, promoting more robust performance across varied scenarios.

Practically, the implications are extensive:

Application Versatility: The adaptability of MOTR to multi-class datasets like BDD100k indicates its potential utility in diverse real-world applications, from autonomous driving to surveillance.
Reduced Post-Processing: The end-to-end nature of MOTR simplifies deployment pipelines by eliminating the need for separate motion and appearance matching post-processing steps.

Future Directions

Several avenues for future research are suggested by the limitations and insights derived from this work:

Improved Detection Mechanisms: To enhance the model’s ability to detect newborn objects robustly, more sophisticated integration of detection and tracking queries or additional auxiliary detection heads could be explored.
Parallel Query Passing: Optimization techniques to improve training efficiency, such as advanced techniques from VisTR, could be revisited to better accommodate frame-to-frame query passing without compromising on the range of motion modeling.

Conclusion

The development of MOTR delineates a significant step toward more efficient and effective multiple-object tracking via Transformers. By harmonizing motion and appearance modeling within an end-to-end framework, MOTR presents not only a strong methodological advance but also lays the groundwork for future innovations in tracking systems.

PDF Markdown

Related Papers

GitHub

GitHub - megvii-research/MOTR: [ECCV2022] MOTR: End-to-End Multiple-Object Tracking with TRansformer (625 stars)

YouTube

Show All Videos