EchoTrack: Auditory Referring Multi-Object Tracking for Autonomous Driving (2402.18302v2)
Abstract: This paper introduces the task of Auditory Referring Multi-Object Tracking (AR-MOT), which dynamically tracks specific objects in a video sequence based on audio expressions and appears as a challenging problem in autonomous driving. Due to the lack of semantic modeling capacity in audio and video, existing works have mainly focused on text-based multi-object tracking, which often comes at the cost of tracking quality, interaction efficiency, and even the safety of assistance systems, limiting the application of such methods in autonomous driving. In this paper, we delve into the problem of AR-MOT from the perspective of audio-video fusion and audio-video tracking. We put forward EchoTrack, an end-to-end AR-MOT framework with dual-stream vision transformers. The dual streams are intertwined with our Bidirectional Frequency-domain Cross-attention Fusion Module (Bi-FCFM), which bidirectionally fuses audio and video features from both frequency- and spatiotemporal domains. Moreover, we propose the Audio-visual Contrastive Tracking Learning (ACTL) regime to extract homogeneous semantic features between expressions and visual objects by learning homogeneous features between different audio and video objects effectively. Aside from the architectural design, we establish the first set of large-scale AR-MOT benchmarks, including Echo-KITTI, Echo-KITTI+, and Echo-BDD. Extensive experiments on the established benchmarks demonstrate the effectiveness of the proposed EchoTrack and its components. The source code and datasets are available at https://github.com/lab206/EchoTrack.
- D. Wu, W. Han, T. Wang, Y. Liu, X. Zhang, and J. Shen, “Language prompt for autonomous driving,” arXiv preprint arXiv:2309.04379, 2023.
- W. Pan, H. Shi, Z. Zhao, J. Zhu, X. He, Z. Pan, L. Gao, J. Yu, F. Wu, and Q. Tian, “Wnet: Audio-guided video object segmentation via wavelet-based cross-modal denoising networks,” in CVPR, 2022, pp. 1320–1331.
- H. Tang, J. Zhu, L. Wang, Q. Zheng, and T. Zhang, “Multi-level query interaction for temporal language grounding,” T-ITS, vol. 23, no. 12, pp. 25 479–25 488, 2022.
- L. Ding, L. Liu, Y. Huang, C. Li, C. Zhang, W. Wang, and L. Wang, “Text-to-image vehicle re-identification: Multi-scale multi-view cross-modal alignment network and a unified benchmark,” T-ITS, 2024.
- D. Wu, W. Han, T. Wang, X. Dong, X. Zhang, and J. Shen, “Referring multi-object tracking,” in CVPR, 2023, pp. 14 633–14 642.
- P. Nguyen, K. G. Quach, K. Kitani, and K. Luu, “Type-to-track: Retrieve any object via prompt-based tracking,” arXiv preprint arXiv:2305.13495, 2023.
- J. Zhou, J. Wang, J. Zhang, W. Sun, J. Zhang, S. Birchfield, D. Guo, L. Kong, M. Wang, and Y. Zhong, “Audio–visual segmentation,” in ECCV, 2022, pp. 386–403.
- Y. Mao, J. Zhang, M. Xiang, Y. Zhong, and Y. Dai, “Multimodal variational auto-encoder based audio-visual segmentation,” in ICCV, 2023, pp. 954–965.
- Y. Zheng, B. Zhong, Q. Liang, G. Li, R. Ji, and X. Li, “Towards unified token learning for vision-language tracking,” TCSVT, 2023.
- M. Elhoseny, “Multi-object detection and tracking (modt) machine learning model for real-time video surveillance systems,” CSSP, vol. 39, pp. 611–630, 2020.
- H. Zhao, X. Wang, D. Wang, H. Lu, and X. Ruan, “Transformer vision-language tracking via proxy token guided cross-modal fusion,” PRL, vol. 168, pp. 10–16, 2023.
- Q. Feng, V. Ablavsky, and S. Sclaroff, “Cityflow-nl: Tracking and retrieval of vehicles at city scale by natural language descriptions,” arXiv preprint arXiv:2101.04741, 2021.
- H. Liu, R. Liu, K. Yang, J. Zhang, K. Peng, and R. Stiefelhagen, “Hida: Towards holistic indoor understanding for the visually impaired via semantic instance segmentation with a wearable solid-state lidar sensor,” in ICCVW, 2021, pp. 1780–1790.
- A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in CVPR, 2012, pp. 3354–3361.
- F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell, “Bdd100k: A diverse driving dataset for heterogeneous multitask learning,” in CVPR, 2020, pp. 2636–2645.
- J. Luiten, A. Osep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taixé, and B. Leibe, “Hota: A higher order metric for evaluating multi-object tracking,” IJCV, vol. 129, pp. 548–578, 2021.
- Y. Zhang, T. Wang, and X. Zhang, “Motrv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors,” in CVPR, 2023, pp. 22 056–22 065.
- W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” TASLP, vol. 29, pp. 3451–3460, 2021.
- J. Chen, J. Lin, Z. Xiao, H. Fu, K. Nai, K. Yang, and Z. Li, “Epcformer: expression prompt collaboration transformer for universal referring video object segmentation,” arXiv preprint arXiv:2308.04162, 2023.
- A. Senocak, H. Ryu, J. Kim, T.-H. Oh, H. Pfister, and J. S. Chung, “Sound source localization is all about cross-modal alignment,” in ICCV, 2023, pp. 7777–7787.
- F. Yan, W. Luo, Y. Zhong, Y. Gan, and L. Ma, “Bridging the gap between end-to-end and non-end-to-end multi-object tracking,” arXiv preprint arXiv:2305.12724, 2023.
- R. Gao and L. Wang, “Memotr: Long-term memory-augmented transformer for multi-object tracking,” in ICCV, 2023, pp. 9901–9910.
- C. Zhang, S. Zheng, H. Wu, Z. Gu, W. Sun, and L. Yang, “Attentiontrack: Multiple object tracking in traffic scenarios using features attention,” T-ITS, vol. 25, no. 2, pp. 1661–1674, 2024.
- Z. Li, K. Nai, G. Li, and S. Jiang, “Learning a dynamic feature fusion tracker for object tracking,” T-ITS, vol. 23, no. 2, pp. 1479–1491, 2022.
- Z. Cao, J. Li, D. Zhang, M. Zhou, and A. Abusorrah, “A multi-object tracking algorithm with center-based feature extraction and occlusion handling,” T-ITS, vol. 24, no. 4, pp. 4464–4473, 2023.
- P. Sun, J. Cao, Y. Jiang, R. Zhang, E. Xie, Z. Yuan, C. Wang, and P. Luo, “Transtrack: Multiple object tracking with transformer,” arXiv preprint arXiv:2012.15460, 2020.
- T. Meinhardt, A. Kirillov, L. Leal-Taixe, and C. Feichtenhofer, “Trackformer: Multi-object tracking with transformers,” in CVPR, 2022, pp. 8844–8854.
- J. Cai, M. Xu, W. Li, Y. Xiong, W. Xia, Z. Tu, and S. Soatto, “Memot: Multi-object tracking with memory,” in CVPR, 2022, pp. 8090–8100.
- H. Mukhtar and M. U. G. Khan, “Stmmot: Advancing multi-object tracking through spatiotemporal memory networks and multi-scale attention pyramids,” NN, vol. 168, pp. 363–379, 2023.
- J. Xu, Y. Cao, Z. Zhang, and H. Hu, “Spatial-temporal relation networks for multi-object tracking,” in ICCV, 2019, pp. 3988–3998.
- S. You, H. Yao, and C. Xu, “Multi-object tracking with spatial-temporal topology-based detector,” TCVST, vol. 32, no. 5, pp. 3023–3035, 2022.
- M. Miah, G.-A. Bilodeau, and N. Saunier, “Multi-object tracking and segmentation with a space-time memory network,” in CRV, 2023, pp. 184–193.
- E. Yu, Z. Li, and S. Han, “Towards discriminative representation: Multi-view trajectory contrastive learning for online multi-object tracking,” in CVPR, 2022, pp. 8834–8843.
- B. Shuai, A. Berneshawi, X. Li, D. Modolo, and J. Tighe, “Siammot: Siamese multi-object tracking,” in CVPR, 2021, pp. 12 372–12 382.
- X. Gao, Z. Shen, and Y. Yang, “Multi-object tracking with siamese-rpn and adaptive matching strategy,” SIVP, vol. 16, no. 4, pp. 965–973, 2022.
- C. Ma, F. Yang, Y. Li, H. Jia, X. Xie, and W. Gao, “Deep trajectory post-processing and position projection for single & multiple camera multiple object tracking,” IJCV, vol. 129, pp. 3255–3278, 2021.
- G. Wang, R. Gu, Z. Liu, W. Hu, M. Song, and J.-N. Hwang, “Track without appearance: Learn box and tracklet embedding with local and global motion patterns for vehicle tracking,” in ICCV, 2021, pp. 9876–9886.
- J. Peng, T. Wang, W. Lin, J. Wang, J. See, S. Wen, and E. Ding, “Tpm: Multiple object tracking with tracklet-plane matching,” PR, vol. 107, p. 107480, 2020.
- F. Saleh, S. Aliakbarian, H. Rezatofighi, M. Salzmann, and S. Gould, “Probabilistic tracklet scoring and inpainting for multiple object tracking,” in CVPR, 2021, pp. 14 329–14 339.
- S.-H. Bae and K.-J. Yoon, “Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning,” in CVPR, 2014, pp. 1218–1225.
- G. Wang, Y. Wang, R. Gu, W. Hu, and J.-N. Hwang, “Split and connect: A universal tracklet booster for multi-object tracking,” TMM, vol. 25, pp. 1256–1268, 2022.
- L. Chen, H. Ai, R. Chen, and Z. Zhuang, “Aggregate tracklet appearance features for multi-object tracking,” SPL, vol. 26, no. 11, pp. 1613–1617, 2019.
- Y. Zhang, P. Sun, Y. Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang, “Bytetrack: Multi-object tracking by associating every detection box,” in ECCV, 2022, pp. 1–21.
- Y. Wang, K. Kitani, and X. Weng, “Joint object detection and multi-object tracking with graph neural networks,” in ICRA, 2021, pp. 13 708–13 715.
- X. Weng, Y. Wang, Y. Man, and K. M. Kitani, “Gnn3dmot: Graph neural network for 3d multi-object tracking with 2d-3d multi-feature learning,” in CVPR, 2020, pp. 6499–6508.
- I. Papakis, A. Sarkar, and A. Karpatne, “Gcnnmatch: Graph convolutional neural networks for multi-object tracking via sinkhorn normalization,” arXiv preprint arXiv:2010.00067, 2020.
- J. Li, X. Gao, and T. Jiang, “Graph networks for multiple object tracking,” in WACV, 2020, pp. 719–728.
- O. Kesa, O. Styles, and V. Sanchez, “Joint learning architecture for multiple object tracking and trajectory forecasting,” arXiv preprint arXiv:2108.10543, 2021.
- S. Sun, N. Akhtar, H. Song, A. Mian, and M. Shah, “Deep affinity network for multiple object tracking,” TPMAI, vol. 43, no. 1, pp. 104–119, 2021.
- W. Qin, H. Du, X. Zhang, Z. Ma, X. Ren, and T. Luo, “Joint prediction and association for deep feature multiple object tracking,” in Journal of Physics: Conference Series, vol. 2026, no. 1, 2021, p. 012021.
- Z. Zou, J. Huang, and P. Luo, “Compensation tracker: reprocessing lost object for multi-object tracking,” in WACV, 2022, pp. 307–317.
- X. Chen, S. M. Iranmanesh, and K.-C. Lien, “Patchtrack: Multiple object tracking using frame patches,” arXiv preprint arXiv:2201.00080, 2022.
- E. Yu, Z. Li, S. Han, and H. Wang, “Relationtrack: Relation-aware multiple object tracking with decoupled representation,” TMM, vol. 25, pp. 2686–2697, 2023.
- Y. Liu, T. Bai, Y. Tian, Y. Wang, J. Wang, X. Wang, and F.-Y. Wang, “Segdq: Segmentation assisted multi-object tracking with dynamic query-based transformers,” Neurocomputing, vol. 481, pp. 91–101, 2022.
- P. Blatter, M. Kanakis, M. Danelljan, and L. Van Gool, “Efficient visual tracking with exemplar transformers,” in WACV, 2023, pp. 1571–1581.
- X. Zhou, T. Yin, V. Koltun, and P. Krähenbühl, “Global tracking transformers,” in CVPR, 2022, pp. 8771–8780.
- F. Zeng, B. Dong, Y. Zhang, T. Wang, X. Zhang, and Y. Wei, “Motr: End-to-end multiple-object tracking with transformer,” in ECCV, 2022, pp. 659–675.
- J. Liu, H. Ding, Z. Cai, Y. Zhang, R. K. Satzoda, V. Mahadevan, and R. Manmatha, “Polyformer: Referring image segmentation as sequential polygon generation,” in CVPR, 2023, pp. 18 653–18 863.
- B. Miao, M. Bennamoun, Y. Gao, and A. Mian, “Spectrum-guided multi-granularity referring video object segmentation,” in ICCV, 2023, pp. 920–930.
- G. Luo, Y. Zhou, X. Sun, L. Cao, C. Wu, C. Deng, and R. Ji, “Multi-task collaborative network for joint referring expression comprehension and segmentation,” in CVPR, 2020, pp. 10 034–10 043.
- J. Jiang, M. Cao, T. Song, L. Chen, Y. Wang, and Y. Zou, “Video referring expression comprehension via transformer with content-conditioned query,” in MM, 2023, pp. 39–48.
- B. Yan, Y. Jiang, J. Wu, D. Wang, P. Luo, Z. Yuan, and H. Lu, “Universal instance perception as object discovery and retrieval,” in CVPR, 2023, pp. 15 325–15 336.
- X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Gao, and Y. J. Lee, “Segment everything everywhere all at once,” arXiv preprint arXiv:2304.06718, 2023.
- R. Qian, D. Hu, H. Dinkel, M. Wu, N. Xu, and W. Lin, “Multiple sound sources localization from coarse to fine,” in ECCV, 2020, pp. 292–308.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” in ICLR, 2021.
- Y. Tatsunami and M. Taki, “Fft-based dynamic token mixer for vision,” arXiv preprint arXiv:2303.03932, 2023.
- L. Chi, B. Jiang, and Y. Mu, “Fast fourier convolution,” in NeurIPS, 2020, pp. 4479–4488.
- A. Senocak, T.-H. Oh, J. Kim, M.-H. Yang, and I. S. Kweon, “Learning to localize sound source in visual scenes,” in CVPR, 2018, pp. 4358–4366.
- H. Chen, W. Xie, T. Afouras, A. Nagrani, A. Vedaldi, and A. Zisserman, “Localizing visual sounds the hard way,” in CVPR, 2021, pp. 16 867–16 876.
- K. Gavrilyuk, A. Ghodrati, Z. Li, and C. G. Snoek, “Actor and action video segmentation from a sentence,” in CVPR, 2018, pp. 5958–5966.
- S. Seo, J.-Y. Lee, and B. Han, “Urvos: Unified referring video object segmentation network with a large-scale benchmark,” in ECCV, 2020, pp. 208–223.
- H. Ding, C. Liu, S. He, X. Jiang, and C. C. Loy, “Mevis: A large-scale benchmark for video segmentation with motion expressions,” in ICCV, 2023, pp. 2694–2703.
- A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler, “Mot16: A benchmark for multi-object tracking,” arXiv preprint arXiv:1603.00831, 2016.
- K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: the clear mot metrics,” JIVP, pp. 1–10, 2008.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in ICML, 2018.
- Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Yolox: Exceeding yolo series in 2021,” arXiv preprint arXiv:2107.08430, 2021.
- N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in ICIP, 2017, pp. 3645–3649.
- C. Liang, Z. Zhang, X. Zhou, B. Li, S. Zhu, and W. Hu, “Rethinking the competition between detection and reid in multiobject tracking,” TIP, vol. 31, pp. 3182–3196, 2022.
- Jiacheng Lin (22 papers)
- Jiajun Chen (125 papers)
- Kunyu Peng (57 papers)
- Xuan He (37 papers)
- Zhiyong Li (31 papers)
- Rainer Stiefelhagen (155 papers)
- Kailun Yang (136 papers)