Z-GMOT: Zero-shot Generic Multiple Object Tracking (2305.17648v4)
Abstract: Despite recent significant progress, Multi-Object Tracking (MOT) faces limitations such as reliance on prior knowledge and predefined categories and struggles with unseen objects. To address these issues, Generic Multiple Object Tracking (GMOT) has emerged as an alternative approach, requiring less prior information. However, current GMOT methods often rely on initial bounding boxes and struggle to handle variations in factors such as viewpoint, lighting, occlusion, and scale, among others. Our contributions commence with the introduction of the \textit{Referring GMOT dataset} a collection of videos, each accompanied by detailed textual descriptions of their attributes. Subsequently, we propose $\mathtt{Z-GMOT}$, a cutting-edge tracking solution capable of tracking objects from \textit{never-seen categories} without the need of initial bounding boxes or predefined categories. Within our $\mathtt{Z-GMOT}$ framework, we introduce two novel components: (i) $\mathtt{iGLIP}$, an improved Grounded language-image pretraining, for accurately detecting unseen objects with specific characteristics. (ii) $\mathtt{MA-SORT}$, a novel object association approach that adeptly integrates motion and appearance-based matching strategies to tackle the complex task of tracking objects with high similarity. Our contributions are benchmarked through extensive experiments conducted on the Referring GMOT dataset for GMOT task. Additionally, to assess the generalizability of the proposed $\mathtt{Z-GMOT}$, we conduct ablation studies on the DanceTrack and MOT20 datasets for the MOT task. Our dataset, code, and models are released at: https://fsoft-aic.github.io/Z-GMOT.
- Bot-sort: Robust associations multi-pedestrian tracking. arXiv preprint arXiv:2206.14651.
- Gmot-40: A benchmark for generic multiple object tracking. In CVPR.
- Keni Bernardin and Rainer Stiefelhagen. 2008. Evaluating multiple object tracking performance: The clear mot metrics. EURASIP Journal on Image and Video Processing, 2008:1–10.
- Simple online and realtime tracking. In ICIP. IEEE.
- Guillem Brasó and Laura Leal-Taixé. 2020. Learning a neural solver for multiple object tracking. In CVPR.
- Memot: Multi-object tracking with memory. In CVPR.
- X-detr: A versatile architecture for instance-wise vision-language tasks. ECCV.
- Observation-centric sort: Rethinking sort for robust multi-object tracking. In CVPR.
- Unifying short and long-term tracking with graph hierarchies. In CVPR.
- Online multiple object tracking using joint detection and embedding network. Pattern Recognition, 130.
- Real-time multiple people tracking with deeply learned candidate selection and person re-identification. ICME, pages 1–6.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
- Transmot: Spatial-temporal graph transformer for multiple object tracking. In WACV.
- Sportsmot: A large multi-object tracking dataset in multiple sports scenes. arXiv preprint arXiv:2304.05170.
- Dynamic head: Unifying object detection heads with attentions. In CVPR.
- Tao: A large-scale benchmark for tracking any object. In ECCV. Springer.
- Mot20: A benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Open-vocabulary panoptic segmentation with maskclip. arXiv preprint arXiv:2208.08984.
- Strongsort: Make deepsort great again. IEEE TMM.
- Lasot: A high-quality benchmark for large-scale single object tracking. In CVPR.
- Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430.
- Open-vocabulary image segmentation. ECCV.
- Open-vocabulary object detection via vision and language knowledge distillation. ICLR.
- Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE TPAMI, 43(5):1562–1577.
- Globaltrack: A simple and strong baseline for long-term tracking. In AAAI, volume 34.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICLR. PMLR.
- Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection. In ICIP, pages 3230–3234. IEEE.
- The caltech fish counting dataset: A benchmark for multiple-object tracking and counting. In ECCV, pages 290–311. Springer.
- A novel performance evaluation methodology for single-target trackers. IEEE TPAMI, 38(11):2137–2155.
- Waver: Writing-style agnostic text-video retrieval via distilling vision-language models through open-vocabulary knowledge. In ICASSP, pages 3025–3029.
- Learning by tracking: Siamese cnn for robust target association. In CVPRW.
- Language-driven semantic segmentation. ICLR.
- Grounded language-image pre-training. In CVPR.
- Rethinking the competition between detection and reid in multi-object tracking. IEEE TIP.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499.
- Swin transformer: Hierarchical vision transformer using shifted windows. In CVPR.
- Hota: A higher order metric for evaluating multi-object tracking. IJCV, 129(2):548–578.
- Wenhan Luo and Tae-Kyun Kim. 2013. Generic object crowd tracking by multi-task learning. In BMVC, volume 1.
- Bi-label propagation for generic multiple object tracking. In CVPR.
- Deep oc-sort: Multi-pedestrian tracking by adaptive re-identification. arXiv preprint arXiv:2302.11813.
- Trackformer: Multi-object tracking with transformers. In CVPR.
- Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831.
- Simple open-vocabulary object detection with vision transformers. ECCV.
- Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In ECCV, pages 300–317.
- Multi-camera multiple 3d object tracking on the move for autonomous vehicles. In CVPR, pages 2569–2578.
- Open-vocabulary affordance detection in 3d point clouds. In IROS, pages 5692–5698. IEEE.
- Quasi-dense similarity learning for multiple object tracking. In CVPR.
- Zeetad: Adapting pretrained vision-language model for zero-shot end-to-end temporal action detection. In WACV, pages 7046–7055.
- Learning transferable visual models from natural language supervision. In ICML. PMLR.
- Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR.
- Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS.
- Performance Measures and a Data Set for Multi-target, Multi-camera Tracking, page 17–35.
- Dancetrack: Multi-object tracking in uniform appearance and diverse motion. In CVPR.
- Transtrack: Multiple-object tracking with transformer. arXiv preprint arXiv: 2012.15460.
- Simultaneous detection and tracking with motion modelling for multiple object tracking. In ECCV, pages 626–643. Springer.
- Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In CVPR, pages 13763–13773.
- Joint Object Detection and Multi-Object Tracking with Graph Neural Networks. arXiv:2006.13164.
- Nicolai Wojke and Alex Bewley. 2018. Deep cosine metric learning for person re-identification. In WACV. IEEE.
- Simple online and realtime tracking with a deep association metric. In ICIP. IEEE.
- Referring multi-object tracking. In CVPR, pages 14633–14642.
- Track to detect and segment: An online multi-object tracker. In CVPR.
- Online object tracking: A benchmark. In CVPR, pages 2411–2418.
- Open-fusion: Real-time open-vocabulary 3d mapping and queryable scene representation. ICRA.
- Vlcap: Vision-language with contrastive learning for coherent video paragraph captioning. In ICIP. IEEE.
- Vltint: visual-linguistic transformer-in-transformer for coherent video paragraph captioning. In AAAI, volume 37, pages 3081–3090.
- Towards grand unification of object tracking. In ECCV.
- Remots: Self-supervised refining multi-object tracking and segmentation. Image and Vision Computing.
- Unified contrastive learning in image-text-label space. In CVPR.
- Motr: End-to-end multiple-object tracking with transformer. In ECCV. Springer.
- Glipv2: Unifying localization and vision-language understanding. NIPS.
- Vision-language models for vision tasks: A survey. IEEE TPAMI.
- Animaltrack: A benchmark for multi-animal tracking in the wild. IJCV.
- Bytetrack: Multi-object tracking by associating every detection box. In ECCV.
- Fairmot: On the fairness of detection and re-identification in multiple object tracking. IJCV, 129:3069–3087.
- Motrv2: Bootstrapping end-to-end multi-object tracking by pretrained object detectors. In CVPR, pages 22056–22065.
- Regionclip: Region-based language-image pretraining. In CVPR.
- Tracking objects as points. In ECCV.