Referring Multi-Object Tracking (2303.03366v2)

Published 6 Mar 2023 in cs.CV

Abstract: Existing referring understanding tasks tend to involve the detection of a single text-referred object. In this paper, we propose a new and general referring understanding task, termed referring multi-object tracking (RMOT). Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking. To the best of our knowledge, it is the first work to achieve an arbitrary number of referent object predictions in videos. To push forward RMOT, we construct one benchmark with scalable expressions based on KITTI, named Refer-KITTI. Specifically, it provides 18 videos with 818 expressions, and each expression in a video is annotated with an average of 10.7 objects. Further, we develop a transformer-based architecture TransRMOT to tackle the new task in an online manner, which achieves impressive detection performance and outperforms other counterparts. The dataset and code will be available at https://github.com/wudongming97/RMOT.

Citations (51)

View on Semantic Scholar

Summary

The paper introduces RMOT, a new task for tracking multiple objects in video sequences using natural language inputs.
It presents the Refer-KITTI dataset with 18 videos and 818 text expressions, averaging 10.7 objects per expression to simulate real-world tracking scenarios.
TransRMOT, the proposed architecture, uses a dual-query transformer with deformable attention to fuse cross-modal features and enhance tracking accuracy.

Overview of "Referring Multi-Object Tracking"

The paper introduces a new task in the field of computer vision termed Referring Multi-Object Tracking (RMOT). Unlike existing referring tasks that focus on detecting a single object based on textual cues, RMOT addresses the need to track multiple objects in video sequences guided by natural language expressions. This framework expands the traditional understanding of referring tasks by allowing for the identification and tracking of an arbitrary number of objects across frames, all prompted by linguistic inputs.

Key Contributions

Novel Task Definition: RMOT is the first of its kind to allow multiple object tracking guided by text. Traditional benchmarks focused on single objects, missing the breadth of real-world scenarios where the same semantic cues can apply to multiple targets.
Benchmark Dataset – Refer-KITTI: The authors propose a new dataset, Refer-KITTI, based on the existing KITTI dataset. It contains 18 videos and 818 textual expressions, with each expression referring to an average of 10.7 objects. This dataset is distinct due to its scalability and the dynamic nature of temporal object status changes, thus simulating a more realistic environment for object tracking.
TransRMOT Architecture: The paper introduces TransRMOT, a transformer-based architecture designed specifically for RMOT. TransRMOT utilizes a cross-modal encoder-decoder architecture where visual and linguistic features are integrated to understand and predict object tracks. It leverages deformable attention mechanisms for efficient cross-frame and cross-modal associations, setting it apart from other methods.

Methodology

TransRMOT enhances the core architecture of the Deformable DETR (Deformable Transformers for End-to-End Object Detection) by incorporating:

Cross-modal Early Fusion: A mechanism for integrating visual and linguistic features efficiently, reducing computational overhead while maintaining robust feature representation.
Dual Queries in Decoder: The architecture differentiates between track queries for existing object identities and detect queries for new or unseen instances. This dual query system allows TransRMOT to manage both spatial and temporal aspects of tracking efficiently.

Experimental Insights

The TransRMOT architecture outperforms a range of CNN-based and traditional tracking models when evaluated on the Refer-KITTI dataset. Metrics such as Higher Order Tracking Accuracy (HOTA) were utilized to assess both detection and association accuracy, demonstrating the effectiveness of TransRMOT in handling real-world multi-object tracking scenarios.

Implications and Future Work

The introduction of RMOT and the accompanying Refer-KITTI dataset paves the way for significant advancements in cross-modal video understanding tasks. The framework facilitates various applications such as autonomous driving and surveillance, where understanding and tracking multiple objects as directed by human language can provide enhanced contextual awareness.

Looking forward, the potential integration of RMOT with more comprehensive datasets and diverse domain applications could further refine and broaden the applicability of this work. This paper not only enhances the current research landscape in object tracking but also challenges future researchers to explore more dynamic and interactive models that cater to complex real-world requirements.