Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 39 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 118 tok/s Pro
Kimi K2 181 tok/s Pro
GPT OSS 120B 429 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

iKUN: Speak to Trackers without Retraining (2312.16245v2)

Published 25 Dec 2023 in cs.CV

Abstract: Referring multi-object tracking (RMOT) aims to track multiple objects based on input textual descriptions. Previous works realize it by simply integrating an extra textual module into the multi-object tracker. However, they typically need to retrain the entire framework and have difficulties in optimization. In this work, we propose an insertable Knowledge Unification Network, termed iKUN, to enable communication with off-the-shelf trackers in a plug-and-play manner. Concretely, a knowledge unification module (KUM) is designed to adaptively extract visual features based on textual guidance. Meanwhile, to improve the localization accuracy, we present a neural version of Kalman filter (NKF) to dynamically adjust process noise and observation noise based on the current motion status. Moreover, to address the problem of open-set long-tail distribution of textual descriptions, a test-time similarity calibration method is proposed to refine the confidence score with pseudo frequency. Extensive experiments on Refer-KITTI dataset verify the effectiveness of our framework. Finally, to speed up the development of RMOT, we also contribute a more challenging dataset, Refer-Dance, by extending public DanceTrack dataset with motion and dressing descriptions. The codes and dataset are available at https://github.com/dyhBUPT/iKUN.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Bot-sort: Robust associations multi-pedestrian tracking. arXiv preprint arXiv:2206.14651, 2022.
  2. Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008:1–10, 2008.
  3. Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP), pages 3464–3468. IEEE, 2016.
  4. High-speed tracking-by-detection without using image information. In 2017 14th IEEE international conference on advanced video and signal based surveillance (AVSS), pages 1–6. IEEE, 2017.
  5. End-to-end referring video object segmentation with multimodal transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4985–4995, 2022.
  6. Observation-centric sort: Rethinking sort for robust multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9686–9696, 2023.
  7. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  8. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
  9. Probabilistic embeddings for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8415–8424, 2021.
  10. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9268–9277, 2019.
  11. No one left behind: Improving the worst categories in long-tailed learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15804–15813, 2023.
  12. Superdisco: Super-class discovery improves visual recognition for the long-tail. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19944–19954, 2023a.
  13. Strongsort: Make deepsort great again. IEEE Transactions on Multimedia, 2023b.
  14. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition, pages 3354–3361. IEEE, 2012.
  15. Mat: Motion-aware multi-object tracking. Neurocomputing, 476:75–86, 2022.
  16. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  17. Long-tail detection with effective class-margins. In European Conference on Computer Vision, pages 698–714. Springer, 2022.
  18. Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2787–2797, 2023.
  19. ultralytics/yolov5: v7. 0-yolov5 sota realtime instance segmentation. Zenodo, 2022.
  20. YOLO by Ultralytics, 2023.
  21. Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. 1960.
  22. Learning semantic-aligned feature representation for text-based person search. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2724–2728. IEEE, 2022.
  23. One more check: making “fake background” be tracked again. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1546–1554, 2022a.
  24. Rethinking the competition between detection and reid in multiobject tracking. IEEE Transactions on Image Processing, 31:3182–3196, 2022b.
  25. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  26. Hota: A higher order metric for evaluating multi-object tracking. International journal of computer vision, 129:548–578, 2021.
  27. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, pages 638–647, 2022.
  28. Deep oc-sort: Multi-pedestrian tracking by adaptive re-identification. arXiv preprint arXiv:2302.11813, 2023.
  29. Trackformer: Multi-object tracking with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8844–8854, 2022.
  30. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  31. Performance measures and a data set for multi-target, multi-camera tracking. In European conference on computer vision, pages 17–35. Springer, 2016.
  32. Transtrack: Multiple object tracking with transformer. arXiv preprint arXiv:2012.15460, 2020.
  33. Dancetrack: Multi-object tracking in uniform appearance and diverse motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20993–21002, 2022.
  34. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  35. Towards real-time multi-object tracking. In European Conference on Computer Vision, pages 107–122. Springer, 2020.
  36. Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP), pages 3645–3649. IEEE, 2017.
  37. Referring multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14633–14642, 2023a.
  38. Onlinerefer: A simple online baseline for referring video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2761–2770, 2023b.
  39. Language as queries for referring video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4974–4984, 2022.
  40. Clip-driven fine-grained text-image person re-identification. arXiv preprint arXiv:2210.10276, 2022.
  41. Image-specific information suppression and implicit local alignment for text-based person search. IEEE Transactions on Neural Networks and Learning Systems, 2023.
  42. Hard to track objects with irregular motions and similar appearances? make it easier by buffering the matching space. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4799–4808, 2023.
  43. Poi: Multiple object tracking with high performance detection and appearance feature. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II 14, pages 36–42. Springer, 2016.
  44. Deep layer aggregation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2403–2412, 2018.
  45. One-stream vision-language memory network for object tracking. IEEE Transactions on Multimedia, 2023a.
  46. Fairmot: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision, 129:3069–3087, 2021.
  47. Bytetrack: Multi-object tracking by associating every detection box. In European Conference on Computer Vision, pages 1–21. Springer, 2022.
  48. Motrv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22056–22065, 2023b.
  49. Transformer vision-language tracking via proxy token guided cross-modal fusion. Pattern Recognition Letters, 168:10–16, 2023a.
  50. Mdcs: More diverse experts with consistency self-distillation for long-tailed recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11597–11608, 2023b.
  51. Towards unified token learning for vision-language tracking. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  52. Joint visual grounding and tracking with natural language specification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23151–23160, 2023.
  53. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
Citations (6)

Summary

  • The paper introduces iKUN, a plug-and-play module that leverages dynamic textual guidance to extract visual features for multi-object tracking without retraining.
  • The paper employs a neural Kalman filter that dynamically adjusts noise parameters, significantly improving tracking prediction and update accuracy.
  • The paper presents test-time similarity calibration that refines confidence scoring, outperforming previous methods on Refer-KITTI and Refer-Dance datasets.

iKUN: Speak to Trackers without Retraining

The paper "iKUN: Speak to Trackers without Retraining" introduces a novel approach to referring multi-object tracking (RMOT) utilizing a new framework, iKUN (insertable Knowledge Unification Network), to enhance the process of tracking multiple objects based on textual descriptions. This paper addresses the significant challenges faced by traditional RMOT methods, which require retraining existing trackers and integrating textual modules into these frameworks, thereby leading to optimization difficulties and increased engineering costs.

Technical Contributions

The iKUN framework distinguishes itself by acting as a plug-and-play module compatible with off-the-shelf trackers. The notable advancements in this framework are highlighted as follows:

  1. Knowledge Unification Module (KUM): iKUN employs KUM to adaptively extract visual features using textual guidance, contrasting with the fixed visual feature extraction in existing models like CLIP. This dynamic approach facilitates nuanced matching between multiple descriptions and a single trajectory, thereby enhancing tracking performance and flexibility.
  2. Neural Kalman Filter (NKF): To improve tracking accuracy, iKUN integrates a neural version of the Kalman filter, wherein process and observation noise are dynamically adjusted. This adaptation to the current motion state advances the prediction and update precision compared to traditional, parameter-dependent Kalman filter configurations.
  3. Test-time Similarity Calibration: Addressing the issue of open-set long-tail distribution in textual descriptions, iKUN introduces a calibration method to refine confidence scores using pseudo frequency estimation. This process allows for a more accurate alignment of textual descriptions in the test set, counteracting skewed frequency distributions.

Experimental Results

The framework's efficacy is substantiated by experimental analyses on the Refer-KITTI dataset, where iKUN outperformed previous methods such as TransRMOT concerning major performance metrics. Specifically, iKUN surpasses TransRMOT by notable margins in HOTA, MOTA, and IDF1 scores, exhibiting the superiority of its modular and adaptive approach. Moreover, the introduction of the Refer-Dance dataset, which extends the capabilities of the iKUN framework, showcases its adaptability in handling diverse tracking scenarios.

Implications and Future Directions

From a practical standpoint, iKUN's architecture allows for significant reductions in retraining time and costs while maintaining state-of-the-art tracking performance. Theoretically, this modular approach bears potential for widespread application across varying tracking systems, encouraging further exploration into dynamic feature extraction methods.

Potential future developments may involve enhancing the temporal modeling capabilities within the iKUN framework, thereby offering a broader application spectrum for motion-critical descriptions. Additionally, fostering cross-disciplinary applications by extending iKUN to different domains such as robotics or autonomous vehicles could unlock further capabilities.

The iKUN framework offers a compelling direction in RMOT research by leveraging its modularity and adaptability, setting a precedent for integrating natural language processing with computer vision tasks seamlessly. This research might pave the way for more versatile and efficient tracking solutions in diverse applications, reinforcing the bridge between language understanding and visual perception.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com