Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 167 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 106 tok/s Pro
Kimi K2 187 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

STT: Stateful Tracking with Transformers for Autonomous Driving (2405.00236v1)

Published 30 Apr 2024 in cs.RO, cs.AI, cs.CV, and cs.LG

Abstract: Tracking objects in three-dimensional space is critical for autonomous driving. To ensure safety while driving, the tracker must be able to reliably track objects across frames and accurately estimate their states such as velocity and acceleration in the present. Existing works frequently focus on the association task while either neglecting the model performance on state estimation or deploying complex heuristics to predict the states. In this paper, we propose STT, a Stateful Tracking model built with Transformers, that can consistently track objects in the scenes while also predicting their states accurately. STT consumes rich appearance, geometry, and motion signals through long term history of detections and is jointly optimized for both data association and state estimation tasks. Since the standard tracking metrics like MOTA and MOTP do not capture the combined performance of the two tasks in the wider spectrum of object states, we extend them with new metrics called S-MOTA and MOTPS that address this limitation. STT achieves competitive real-time performance on the Waymo Open Dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Z. Pang, Z. Li, and N. Wang, “Simpletrack: Understanding and rethinking 3d multi-object tracking,” arXiv:2111.09621, 2021.
  2. X. Weng and K. Kitani, “A baseline for 3D multi-object tracking,” arXiv:1907.03961, 2019.
  3. Q. Wang, Y. Chen, Z. Pang, N. Wang, and Z. Zhang, “Immortal tracker: Tracklet never dies,” arXiv:2111.13672, 2021.
  4. S. Lee and J. McBride, “Extended object tracking via positive and negative information fusion,” IEEE Trans. Signal Process., vol. 67, no. 7, pp. 1812–1823, 2019.
  5. X. Rong Li and V. Jilkov, “Survey of maneuvering target tracking. part i. dynamic models,” IEEE Trans. Aerosp. Electron. Syst., vol. 39, no. 4, pp. 1333–1364, 2003.
  6. E. Cortina, D. Otero, and C. D’Attellis, “Maneuvering target tracking using extended kalman filter,” IEEE Trans. Aerosp. Electron. Syst., vol. 27, no. 1, pp. 155–158, 1991.
  7. S. Lee, J. Lee, and I. Hwang, “Maneuvering spacecraft tracking via state-dependent adaptive estimation,” Journal of Guidance, Control, and Dynamics, vol. 39, no. 9, pp. 2034–2043, 2016.
  8. T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3d object detection and tracking,” in CVPR, 2021.
  9. Y. Xiang, A. Alahi, and S. Savarese, “Learning to track: Online multi-object tracking by decision making,” in ICCV, 2015.
  10. X. Zhou, V. Koltun, and P. Krähenbühl, “Tracking objects as points,” ECCV, 2020.
  11. K. Bernardin, A. Elbs, and R. Stiefelhagen, “Multiple object tracking performance metrics and evaluation in a smart room environment,” in Sixth IEEE International Workshop on Visual Surveillance, in conjunction with ECCV, 2006.
  12. P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine et al., “Scalability in perception for autonomous driving: Waymo open dataset,” in CVPR, 2020.
  13. A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler, “MOT16: A benchmark for multi-object tracking,” arXiv:1603.00831, 2016.
  14. L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler, “MOTChallenge 2015: Towards a benchmark for multi-target tracking,” arXiv:1504.01942, 2015.
  15. A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in CVPR, 2012.
  16. P. Chu, J. Wang, Q. You, H. Ling, and Z. Liu, “Transmot: Spatial-temporal graph transformer for multiple object tracking,” arXiv:2104.00194, 2021.
  17. J. Peng, T. Wang, W. Lin, J. Wang, J. See, S. Wen, and E. Ding, “Tpm: Multiple object tracking with tracklet-plane matching,” Pattern Recognition, 2020.
  18. J. Peng, C. Wang, F. Wan, Y. Wu, Y. Wang, Y. Tai, C. Wang, J. Li, F. Huang, and Y. Fu, “Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking,” in ECCV, 2020.
  19. J. Wu, J. Cao, L. Song, Y. Wang, M. Yang, and J. Yuan, “Track to detect and segment: An online multi-object tracker,” in CVPR, 2021.
  20. Q. Yu, G. Medioni, and I. Cohen, “Multiple target tracking using spatio-temporal markov chain monte carlo data association,” in CVPR, 2007.
  21. Z. Wang, L. Zheng, Y. Liu, and S. Wang, “Towards real-time multi-object tracking,” in ECCV, 2020.
  22. P. Dai, R. Weng, W. Choi, C. Zhang, Z. He, and W. Ding, “Learning a proposal classifier for multiple object tracking,” CVPR, 2021.
  23. F. Zeng, B. Dong, T. Wang, C. Chen, X. Zhang, and Y. Wei, “End-to-end multiple-object tracking with transformer,” ECCV, 2022.
  24. Y. Xu, Y. Ban, G. Delorme, C. Gan, D. Rus, and X. Alameda-Pineda, “Transcenter: Transformers with dense queries for multiple-object tracking,” arXiv:2103.15145, 2021.
  25. J. Pang, L. Qiu, X. Li, H. Chen, Q. Li, T. Darrell, and F. Yu, “Quasi-dense similarity learning for multiple object tracking,” in CVPR, 2021.
  26. P. Sun, J. Cao, Y. Jiang, R. Zhang, E. Xie, Z. Yuan, C. Wang, and P. Luo, “Transtrack: Multiple-object tracking with transformer,” arXiv:2012.15460, 2020.
  27. Q. Wang, Y. Zheng, P. Pan, and Y. Xu, “Multiple object tracking with correlation learning,” CVPR, 2021.
  28. X. Zhou, T. Yin, V. Koltun, and P. Krähenbühl, “Global tracking transformers,” in CVPR, 2022.
  29. J. Xu, Y. Cao, Z. Zhang, and H. Hu, “Spatial-temporal relation networks for multi-object tracking,” in ICCV, 2019.
  30. H. Xiang, R. Xu, and J. Ma, “Hm-vit: Hetero-modal vehicle-to-vehicle cooperative perception with vision transformer,” arXiv:2304.10628, 2023.
  31. T. Meinhardt, A. Kirillov, L. Leal-Taixe, and C. Feichtenhofer, “Trackformer: Multi-object tracking with transformers,” CVPR, 2022.
  32. A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in ICIP, 2016.
  33. N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in ICIP, 2017.
  34. P. Bergmann, T. Meinhardt, and L. Leal-Taixe, “Tracking without bells and whistles,” in ICCV, 2019.
  35. S. Tang, M. Andriluka, B. Andres, and B. Schiele, “Multiple people tracking by lifted multicut and person re-identification,” in CVPR, 2017.
  36. Y. Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu, “Fairmot: On the fairness of detection and re-identification in multiple object tracking,” arXiv:2004.01888, 2020.
  37. Q. Zhou, S. Agostinho, A. Osep, and L. Leal-Taixe, “Is geometry enough for matching in visual localization?” ECCV, 2022.
  38. A. Kim, G. Brasó, A. Ošep, and L. Leal-Taixé, “Polarmot: How far can geometric relations take us in 3d multi-object tracking?” in ECCV, 2022.
  39. M. Gladkova, N. Korobov, N. Demmel, A. Ošep, L. Leal-Taixé, and D. Cremers, “Directtracker: 3d multi-object tracking using direct image alignment and photometric bundle adjustment,” IROS, 2022.
  40. A. Kim, A. Ošep, and L. Leal-Taixé, “Eagermot: 3d multi-object tracking via sensor fusion,” in ICRA, 2021.
  41. W.-C. Hung, H. Kretzschmar, T.-Y. Lin, Y. Chai, R. Yu, M.-H. Yang, and D. Anguelov, “Soda: Multi-object tracking with soft data association,” arXiv:2008.07725, 2020.
  42. R. Xu, H. Xiang, X. Xia, X. Han, J. Li, and J. Ma, “Opv2v: An open benchmark dataset and fusion pipeline for perception with vehicle-to-vehicle communication,” in ICRA, 2022.
  43. H. Kuang Chiu, A. Prioletti, J. Li, and J. Bohg, “Probabilistic 3d multi-object tracking for autonomous driving,” arXiv 2001.05673, 2020.
  44. J. Pang, L. Qiu, X. Li, H. Chen, Q. Li, T. Darrell, and F. Yu, “Quasi-dense similarity learning for multiple object tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 164–173.
  45. H.-N. Hu, Y.-H. Yang, T. Fischer, T. Darrell, F. Yu, and M. Sun, “Monocular quasi-dense 3d object tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 2, pp. 1992–2008, 2022.
  46. Y. Chen, J. Liu, X. Zhang, X. Qi, and J. Jia, “Voxelnext: Fully sparse voxelnet for 3d object detection and tracking,” arXiv:2303.11301, 2023.
  47. H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuScenes: A multimodal dataset for autonomous driving,” in CVPR, 2020.
  48. R. Stiefelhagen, K. Bernardin, R. Bowers, J. Garofolo, D. Mostefa, and P. Soundararajan, “The clear 2006 evaluation,” in International evaluation workshop on classification of events, activities and relationships.   Springer, 2006.
  49. J. Luiten, A. Osep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taixé, and B. Leibe, “Hota: A higher order metric for evaluating multi-object tracking,” IJCV, 2021.
  50. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” NeurIPS, 2017.
  51. H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
  52. X. Chen, S. Shi, C. Zhang, B. Zhu, Q. Wang, K. C. Cheung, S. See, and H. Li, “Trajectoryformer: 3d object tracking transformer with predictive trajectory hypotheses,” in ICCV, 2023.
  53. P. Sun, M. Tan, W. Wang, C. Liu, F. Xia, Z. Leng, and D. Anguelov, “Swformer: Sparse window transformer for 3d object detection in point clouds,” in ECCV, 2022.
  54. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv:1711.05101, 2017.
  55. P. Li and J. Jin, “Time3d: End-to-end joint monocular 3d object detection and tracking for autonomous driving,” in CVPR, 2022.
  56. X. Weng, J. Wang, D. Held, and K. Kitani, “3d multi-object tracking: A baseline and new evaluation metrics,” in IROS, 2020.
  57. A. Sabne, “Xla: Compiling machine learning for peak performance,” 2020.
  58. Z. Leng, G. Li, C. Liu, E. D. Cubuk, P. Sun, T. He, D. Anguelov, and M. Tan, “Lidaraugment: Searching for scalable 3d lidar data augmentations,” arXiv:2210.13488, 2022.

Summary

  • The paper introduces a unified transformer-based model that combines data association and state estimation for accurate 3D object tracking.
  • It proposes novel evaluation metrics, S-MOTA and MOTP_S, which assess detection and state prediction precision.
  • Experimental results on the Waymo Open Dataset demonstrate significant improvements in tracking reliability for autonomous driving.

Understanding Stateful 3D Object Tracking with Transformers

Introduction to Stateful Object Tracking with Transformers (STT)

Tracking objects in real-world 3D scenes is an essential capability for autonomous driving technologies, where precise object tracking directly contributes to safety and operational efficiency. Traditionally, models have tackled aspects of object tracking— namely, data association (linking the same object across different frames) and state estimation (calculating each object's position, velocity, and acceleration). However, most models have not efficiently integrated these two tasks, often at the expense of the overall accuracy and performance of state estimation.

STT introduces a new approach by unifying these two functionalities into a single model architecture using transformers. This integration promises improvements in both tracking accuracy and state reliability, crucial for dynamic environments like those encountered in autonomous driving.

Key Innovations and Findings

Architecture Overview:

STT utilizes a Transformer-based architecture containing separate but interconnected modules for both data association and state estimation:

  1. Track-Detection Interaction (TDI) Module: Handles the data association by understanding the contextual relationship between detected objects across frames.
  2. Track State Decoder (TSD): Dedicated to estimating the states—particularly velocity and acceleration—of each track from frame-to-frame.

These components are designed to interact seamlessly, with the transformer model effectively encoding and using historical data to predict current object states more accurately.

Newly Proposed Evaluation Metrics:

Given the nuanced objectives of STT, traditional metrics like MOTA and MOTP fall short as they primarily assess detection and basic tracking but overlook detailed state estimations. To address this, the paper introduces two new metrics:

  • S-MOTA: Extends traditional MOTA by incorporating thresholds for state estimation accuracy, ensuring that only correctly estimated states contribute positively to the metric.
  • MOTP_S: Focuses on measuring the precision of state estimates directly, providing a detailed assessment of prediction accuracy for each state type like velocity and acceleration.

Performance Results:

STT has been rigorously tested on the Waymo Open Dataset, where it demonstrated a competitive edge over other models with traditional metrics and established new benchmarks with the proposed state-focused metrics. It achieved a MOTA of 58.2 and showed superior performance in state estimation accuracy, illustrating the benefits of its integrated approach to tracking and state prediction.

Implications and Future Prospects

The introduction of STT could lead to significant advancements in autonomous vehicle technologies through improved real-time decision-making provided by more accurate state predictions. The model's ability to understand and predict complex object interactions and movements in three-dimensional space ensures more reliable vehicle navigation and operation.

Furthermore, while this paper focused on autonomous driving, the application of such models could be extended to other areas of robotics and motion analysis where precise tracking and state estimation are crucial.

Looking ahead, the integration of even more diverse data inputs and the refinement of transformer models could enhance the robustness and versatility of tracking systems. Additionally, as state estimation becomes more accurate and reliable, we might see autonomous systems capable of more nuanced interactions and decisions in increasingly complex environments.

In conclusion, the development of the STT model represents a significant step forward in object tracking technology, pushing the boundaries of what's possible with AI in both practical applications and methodological advancements. The continued exploration and expansion of these capabilities promise even greater contributions to the field of autonomous systems in the future.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 5 tweets and received 106 likes.

Upgrade to Pro to view all of the tweets about this paper: