- The paper introduces RMOT, a new task for tracking multiple objects in video sequences using natural language inputs.
- It presents the Refer-KITTI dataset with 18 videos and 818 text expressions, averaging 10.7 objects per expression to simulate real-world tracking scenarios.
- TransRMOT, the proposed architecture, uses a dual-query transformer with deformable attention to fuse cross-modal features and enhance tracking accuracy.
Overview of "Referring Multi-Object Tracking"
The paper introduces a new task in the field of computer vision termed Referring Multi-Object Tracking (RMOT). Unlike existing referring tasks that focus on detecting a single object based on textual cues, RMOT addresses the need to track multiple objects in video sequences guided by natural language expressions. This framework expands the traditional understanding of referring tasks by allowing for the identification and tracking of an arbitrary number of objects across frames, all prompted by linguistic inputs.
Key Contributions
- Novel Task Definition: RMOT is the first of its kind to allow multiple object tracking guided by text. Traditional benchmarks focused on single objects, missing the breadth of real-world scenarios where the same semantic cues can apply to multiple targets.
- Benchmark Dataset – Refer-KITTI: The authors propose a new dataset, Refer-KITTI, based on the existing KITTI dataset. It contains 18 videos and 818 textual expressions, with each expression referring to an average of 10.7 objects. This dataset is distinct due to its scalability and the dynamic nature of temporal object status changes, thus simulating a more realistic environment for object tracking.
- TransRMOT Architecture: The paper introduces TransRMOT, a transformer-based architecture designed specifically for RMOT. TransRMOT utilizes a cross-modal encoder-decoder architecture where visual and linguistic features are integrated to understand and predict object tracks. It leverages deformable attention mechanisms for efficient cross-frame and cross-modal associations, setting it apart from other methods.
Methodology
TransRMOT enhances the core architecture of the Deformable DETR (Deformable Transformers for End-to-End Object Detection) by incorporating:
- Cross-modal Early Fusion: A mechanism for integrating visual and linguistic features efficiently, reducing computational overhead while maintaining robust feature representation.
- Dual Queries in Decoder: The architecture differentiates between track queries for existing object identities and detect queries for new or unseen instances. This dual query system allows TransRMOT to manage both spatial and temporal aspects of tracking efficiently.
Experimental Insights
The TransRMOT architecture outperforms a range of CNN-based and traditional tracking models when evaluated on the Refer-KITTI dataset. Metrics such as Higher Order Tracking Accuracy (HOTA) were utilized to assess both detection and association accuracy, demonstrating the effectiveness of TransRMOT in handling real-world multi-object tracking scenarios.
Implications and Future Work
The introduction of RMOT and the accompanying Refer-KITTI dataset paves the way for significant advancements in cross-modal video understanding tasks. The framework facilitates various applications such as autonomous driving and surveillance, where understanding and tracking multiple objects as directed by human language can provide enhanced contextual awareness.
Looking forward, the potential integration of RMOT with more comprehensive datasets and diverse domain applications could further refine and broaden the applicability of this work. This paper not only enhances the current research landscape in object tracking but also challenges future researchers to explore more dynamic and interactive models that cater to complex real-world requirements.