iKUN: Speak to Trackers without Retraining (2312.16245v2)

Published 25 Dec 2023 in cs.CV

Abstract: Referring multi-object tracking (RMOT) aims to track multiple objects based on input textual descriptions. Previous works realize it by simply integrating an extra textual module into the multi-object tracker. However, they typically need to retrain the entire framework and have difficulties in optimization. In this work, we propose an insertable Knowledge Unification Network, termed iKUN, to enable communication with off-the-shelf trackers in a plug-and-play manner. Concretely, a knowledge unification module (KUM) is designed to adaptively extract visual features based on textual guidance. Meanwhile, to improve the localization accuracy, we present a neural version of Kalman filter (NKF) to dynamically adjust process noise and observation noise based on the current motion status. Moreover, to address the problem of open-set long-tail distribution of textual descriptions, a test-time similarity calibration method is proposed to refine the confidence score with pseudo frequency. Extensive experiments on Refer-KITTI dataset verify the effectiveness of our framework. Finally, to speed up the development of RMOT, we also contribute a more challenging dataset, Refer-Dance, by extending public DanceTrack dataset with motion and dressing descriptions. The codes and dataset are available at https://github.com/dyhBUPT/iKUN.

References (53)

Citations (6)

View on Semantic Scholar

Summary

The paper introduces iKUN, a plug-and-play module that leverages dynamic textual guidance to extract visual features for multi-object tracking without retraining.
The paper employs a neural Kalman filter that dynamically adjusts noise parameters, significantly improving tracking prediction and update accuracy.
The paper presents test-time similarity calibration that refines confidence scoring, outperforming previous methods on Refer-KITTI and Refer-Dance datasets.

iKUN: Speak to Trackers without Retraining

The paper "iKUN: Speak to Trackers without Retraining" introduces a novel approach to referring multi-object tracking (RMOT) utilizing a new framework, iKUN (insertable Knowledge Unification Network), to enhance the process of tracking multiple objects based on textual descriptions. This paper addresses the significant challenges faced by traditional RMOT methods, which require retraining existing trackers and integrating textual modules into these frameworks, thereby leading to optimization difficulties and increased engineering costs.

Technical Contributions

The iKUN framework distinguishes itself by acting as a plug-and-play module compatible with off-the-shelf trackers. The notable advancements in this framework are highlighted as follows:

Knowledge Unification Module (KUM): iKUN employs KUM to adaptively extract visual features using textual guidance, contrasting with the fixed visual feature extraction in existing models like CLIP. This dynamic approach facilitates nuanced matching between multiple descriptions and a single trajectory, thereby enhancing tracking performance and flexibility.
Neural Kalman Filter (NKF): To improve tracking accuracy, iKUN integrates a neural version of the Kalman filter, wherein process and observation noise are dynamically adjusted. This adaptation to the current motion state advances the prediction and update precision compared to traditional, parameter-dependent Kalman filter configurations.
Test-time Similarity Calibration: Addressing the issue of open-set long-tail distribution in textual descriptions, iKUN introduces a calibration method to refine confidence scores using pseudo frequency estimation. This process allows for a more accurate alignment of textual descriptions in the test set, counteracting skewed frequency distributions.

Experimental Results

The framework's efficacy is substantiated by experimental analyses on the Refer-KITTI dataset, where iKUN outperformed previous methods such as TransRMOT concerning major performance metrics. Specifically, iKUN surpasses TransRMOT by notable margins in HOTA, MOTA, and IDF1 scores, exhibiting the superiority of its modular and adaptive approach. Moreover, the introduction of the Refer-Dance dataset, which extends the capabilities of the iKUN framework, showcases its adaptability in handling diverse tracking scenarios.

Implications and Future Directions

From a practical standpoint, iKUN's architecture allows for significant reductions in retraining time and costs while maintaining state-of-the-art tracking performance. Theoretically, this modular approach bears potential for widespread application across varying tracking systems, encouraging further exploration into dynamic feature extraction methods.

Potential future developments may involve enhancing the temporal modeling capabilities within the iKUN framework, thereby offering a broader application spectrum for motion-critical descriptions. Additionally, fostering cross-disciplinary applications by extending iKUN to different domains such as robotics or autonomous vehicles could unlock further capabilities.

The iKUN framework offers a compelling direction in RMOT research by leveraging its modularity and adaptability, setting a precedent for integrating natural language processing with computer vision tasks seamlessly. This research might pave the way for more versatile and efficient tracking solutions in diverse applications, reinforcing the bridge between language understanding and visual perception.