Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking (2307.15700v3)

Published 28 Jul 2023 in cs.CV

Abstract: As a video task, Multiple Object Tracking (MOT) is expected to capture temporal information of targets effectively. Unfortunately, most existing methods only explicitly exploit the object features between adjacent frames, while lacking the capacity to model long-term temporal information. In this paper, we propose MeMOTR, a long-term memory-augmented Transformer for multi-object tracking. Our method is able to make the same object's track embedding more stable and distinguishable by leveraging long-term memory injection with a customized memory-attention layer. This significantly improves the target association ability of our model. Experimental results on DanceTrack show that MeMOTR impressively surpasses the state-of-the-art method by 7.9% and 13.0% on HOTA and AssA metrics, respectively. Furthermore, our model also outperforms other Transformer-based methods on association performance on MOT17 and generalizes well on BDD100K. Code is available at https://github.com/MCG-NJU/MeMOTR.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Tracking without bells and whistles. In ICCV, pages 941–951. IEEE, 2019.
  2. Evaluating multiple object tracking performance: The CLEAR MOT metrics. EURASIP J. Image Video Process., 2008, 2008.
  3. Simple online and realtime tracking. In ICIP, pages 3464–3468. IEEE, 2016.
  4. Memot: Multi-object tracking with memory. In CVPR, pages 8080–8090. IEEE, 2022.
  5. Observation-centric SORT: rethinking SORT for robust multi-object tracking. CoRR, abs/2203.14360, 2022.
  6. End-to-end object detection with transformers. In ECCV (1), volume 12346 of Lecture Notes in Computer Science, pages 213–229. Springer, 2020.
  7. A unified framework for multi-target tracking and collective activity recognition. In ECCV (4), volume 7575 of Lecture Notes in Computer Science, pages 215–230. Springer, 2012.
  8. SportsMOT: A large multi-object tracking dataset in multiple sports scenes. In ICCV, 2023.
  9. MOT20: A benchmark for multi object tracking in crowded scenes. CoRR, abs/2003.09003, 2020.
  10. Quo vadis: Is trajectory forecasting the key towards long-term multi-object tracking? In NeurIPS, 2022.
  11. Motsynth: How can synthetic data help pedestrian detection and tracking? In ICCV, pages 10829–10839. IEEE, 2021.
  12. YOLOX: exceeding YOLO series in 2021. CoRR, abs/2107.08430, 2021.
  13. Soccernet: A scalable dataset for action spotting in soccer videos. In CVPR Workshops, pages 1711–1721. Computer Vision Foundation / IEEE Computer Society, 2018.
  14. Social GAN: socially acceptable trajectories with generative adversarial networks. In CVPR, pages 2255–2264. Computer Vision Foundation / IEEE Computer Society, 2018.
  15. Deep residual learning for image recognition. In CVPR, pages 770–778. IEEE Computer Society, 2016.
  16. A survey on visual surveillance of object motion and behaviors. IEEE Trans. Syst. Man Cybern. Part C, 34(3):334–352, 2004.
  17. End-to-end tracking with a multi-query transformer. CoRR, abs/2210.14601, 2022.
  18. Tracking every thing in the wild. In ECCV (22), volume 13682 of Lecture Notes in Computer Science, pages 498–515. Springer, 2022.
  19. Microsoft COCO: common objects in context. In ECCV (5), volume 8693 of Lecture Notes in Computer Science, pages 740–755. Springer, 2014.
  20. DAB-DETR: dynamic anchor boxes are better queries for DETR. In ICLR. OpenReview.net, 2022.
  21. HOTA: A higher order metric for evaluating multi-object tracking. Int. J. Comput. Vis., 129(2):548–578, 2021.
  22. Trackformer: Multi-object tracking with transformers. In CVPR, pages 8834–8844. IEEE, 2022.
  23. MOT16: A benchmark for multi-object tracking. CoRR, abs/1603.00831, 2016.
  24. Quasi-dense similarity learning for multiple object tracking. In CVPR, pages 164–173. Computer Vision Foundation / IEEE, 2021.
  25. Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015.
  26. Performance measures and a data set for multi-target, multi-camera tracking. In ECCV Workshops (2), volume 9914 of Lecture Notes in Computer Science, pages 17–35, 2016.
  27. Features for multi-target multi-camera tracking and re-identification. In CVPR, pages 6036–6046. Computer Vision Foundation / IEEE Computer Society, 2018.
  28. Crowdhuman: A benchmark for detecting human in a crowd. CoRR, abs/1805.00123, 2018.
  29. Adaptive background mixture models for real-time tracking. In CVPR, pages 2246–2252. IEEE Computer Society, 1999.
  30. Dancetrack: Multi-object tracking in uniform appearance and diverse motion. In CVPR, pages 20961–20970. IEEE, 2022.
  31. Transtrack: Multiple-object tracking with transformer. CoRR, abs/2012.15460, 2020.
  32. Attention is all you need. In NIPS, pages 5998–6008, 2017.
  33. Towards real-time multi-object tracking. In ECCV (11), volume 12356 of Lecture Notes in Computer Science, pages 107–122. Springer, 2020.
  34. An introduction to the kalman filter. 1995.
  35. Simple online and realtime tracking with a deep association metric. In ICIP, pages 3645–3649. IEEE, 2017.
  36. Tracking by associating clips. In ECCV (25), volume 13685 of Lecture Notes in Computer Science, pages 129–145. Springer, 2022.
  37. Track to detect and segment: An online multi-object tracker. In CVPR, pages 12352–12361. Computer Vision Foundation / IEEE, 2021.
  38. Transcenter: Transformers with dense queries for multiple-object tracking. CoRR, abs/2103.15145, 2021.
  39. Towards grand unification of object tracking. In ECCV (21), volume 13681 of Lecture Notes in Computer Science, pages 733–751. Springer, 2022.
  40. Multiple object tracking challenge technical report for team mt  iot, 2022.
  41. Hard to track objects with irregular motions and similar appearances? make it easier by buffering the matching space. In WACV, pages 4788–4797. IEEE, 2023.
  42. BDD100K: A diverse driving dataset for heterogeneous multitask learning. In CVPR, pages 2633–2642. Computer Vision Foundation / IEEE, 2020.
  43. MOTR: end-to-end multiple-object tracking with transformer. In ECCV (27), volume 13687 of Lecture Notes in Computer Science, pages 659–675. Springer, 2022.
  44. Bytetrack: Multi-object tracking by associating every detection box. In ECCV (22), volume 13682 of Lecture Notes in Computer Science, pages 1–21. Springer, 2022.
  45. Fairmot: On the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis., 129(11):3069–3087, 2021.
  46. Motrv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors, 2022.
  47. Tracking objects as points. In ECCV (4), volume 12349 of Lecture Notes in Computer Science, pages 474–490. Springer, 2020.
  48. Global tracking transformers. In CVPR, pages 8761–8770. IEEE, 2022.
  49. Deformable DETR: deformable transformers for end-to-end object detection. In ICLR. OpenReview.net, 2021.
Citations (36)

Summary

  • The paper introduces MeMOTR, a memory-augmented Transformer that significantly enhances long-term association in multi-object tracking.
  • It employs a customized memory-attention layer and adaptive aggregation strategy to maintain stable track embeddings over extended sequences.
  • Experimental results on datasets like DanceTrack, MOT17, and BDD100K show significant improvements in HOTA and AssA metrics compared to state-of-the-art methods.

Analyzing MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking

The research paper introduces MeMOTR, a novel approach to multi-object tracking (MOT) leveraging a long-term memory-augmented Transformer. Traditional methods in MOT primarily focus on associating object features between adjacent frames, often neglecting long-term temporal information that can enhance tracking accuracy and stability. The proposed MeMOTR architecture integrates long-term memory with Transformer mechanisms to improve the robustness of object association over time.

Methodology and Architectural Insights

MeMOTR advances the MOT task by addressing the limitations of short-term object feature association and improving the target's track embedding through long-term memory integration. The core contributions of the paper include:

  • Long-Term Memory Incorporation: The method maintains a long-term memory for each tracked object using an exponential recursion update algorithm. This approach allows for a more stable track embedding by injecting this memory into the model, ultimately enhancing the model's ability to distinguish and associate tracked objects over extended sequences.
  • Customized Memory-Attention Layer: A memory-attention layer is employed to generate a distinguishable representation of objects. By interacting with long-term memory, this layer reduces abrupt changes in track embeddings between frames, which is crucial for maintaining consistent object tracking, especially in complex scenes with many similar objects.
  • Adaptive Aggregation: The model utilizes an adaptive aggregation strategy, fusing object features from adjacent frames to enhance tracking robustness. This strategy serves to alleviate issues such as occlusion and blur by dynamically adjusting the influence of the current and previous frame's outputs.
  • Improved Detection-Tracking Alignment: Addressing the potential semantic gap between detection and tracking queries, the paper introduces an additional layer within the Transformer architecture specifically tailored for initial object detection. This layer aids in better aligning detection with the existing tracked targets, facilitating more accurate tracking.

Experimental Evaluation and Results

The experimental results substantiate the effectiveness of the MeMOTR model. On the DanceTrack dataset, recognized for its association challenges, MeMOTR outperformed state-of-the-art strategies by achieving notable improvements in Higher Order Metric for Evaluating Multi-Object Tracking (HOTA) and Association Accuracy (AssA). The model also demonstrated superior association performance on MOT17 and generalized well on BDD100K, further confirming the efficacy of the proposed long-term memory mechanism.

Significantly, MeMOTR achieved a 7.9% and 13.0% improvement on HOTA and AssA metrics, respectively, over previous leading methods on the DanceTrack dataset. These numerical results reflect the impact of integrating a memory-augmented Transformer in enhancing association reliability in MOT, particularly in complex scenarios such as tracking group dancers or sports players.

Implications and Future Directions

The theoretical and practical implications of this research are substantial. For theory, this work extends the capabilities of Transformers in temporal modeling within the vision domain, promising a new avenue for applying long-term memory to extract more informative features for sequence-based tasks. Practically, MeMOTR could be pivotal in applications requiring precise and consistent tracking over time, such as autonomous driving and intelligent surveillance systems.

Future developments in this area might explore optimizing the long-term memory update strategy for different datasets, examining alternative memory structures, or integrating additional cues (e.g., motion estimation models) to further refine tracking accuracy. Moreover, adapting the MeMOTR framework to work seamlessly with other object detection paradigms or backbone architectures could further extend its applicability and effectiveness across varying MOT scenarios.

Overall, the introduction of MeMOTR represents a significant stride towards leveraging long-term temporal dynamics in multi-object tracking, providing a strong foundation for future research and applications in this domain.

Github Logo Streamline Icon: https://streamlinehq.com