Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Single-Model and Any-Modality for Video Object Tracking (2311.15851v3)

Published 27 Nov 2023 in cs.CV

Abstract: In the realm of video object tracking, auxiliary modalities such as depth, thermal, or event data have emerged as valuable assets to complement the RGB trackers. In practice, most existing RGB trackers learn a single set of parameters to use them across datasets and applications. However, a similar single-model unification for multi-modality tracking presents several challenges. These challenges stem from the inherent heterogeneity of inputs -- each with modality-specific representations, the scarcity of multi-modal datasets, and the absence of all the modalities at all times. In this work, we introduce Un-Track, a Unified Tracker of a single set of parameters for any modality. To handle any modality, our method learns their common latent space through low-rank factorization and reconstruction techniques. More importantly, we use only the RGB-X pairs to learn the common latent space. This unique shared representation seamlessly binds all modalities together, enabling effective unification and accommodating any missing modality, all within a single transformer-based architecture. Our Un-Track achieves +8.1 absolute F-score gain, on the DepthTrack dataset, by introducing only +2.14 (over 21.50) GFLOPs with +6.6M (over 93M) parameters, through a simple yet efficient prompting strategy. Extensive comparisons on five benchmark datasets with different modalities show that Un-Track surpasses both SOTA unified trackers and modality-specific counterparts, validating our effectiveness and practicality. The source code is publicly available at https://github.com/Zongwei97/UnTrack.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (100)
  1. Joint reconstruction and low-rank decomposition for dynamic inverse problems. Inverse Problems and Imaging, 16(3):483–523, 2022.
  2. What is the state of neural network pruning? PMLR, 2020.
  3. Backbone is all your need: A simplified architecture for visual object tracking. In ECCV. Springer, 2022a.
  4. 3et: Efficient event-based eye tracking using a change-based convlstm network. arXiv preprint arXiv:2308.11771, 2023a.
  5. Tensor low-rank reconstruction for semantic segmentation. In ECCV. Springer, 2020a.
  6. Transformer tracking. In CVPR, 2021a.
  7. Transformer tracking. In CVPR, 2021b.
  8. High-performance transformer tracking. TPAMI, 2022b.
  9. Seqtrack: Sequence to sequence learning for visual object tracking. In CVPR, 2023b.
  10. Siamese box adaptive network for visual tracking. In CVPR, 2020b.
  11. Soccernet-tracking: Multiple object tracking dataset and benchmark in soccer videos. In CVPR, 2022.
  12. Mixformer: End-to-end tracking with iterative mixed attention. In CVPR, 2022.
  13. ATOM: Accurate tracking by overlap maximization. In CVPR, 2019.
  14. Probabilistic regression for visual tracking. In CVPR, 2020.
  15. LaSOT: A high-quality benchmark for large-scale single object tracking. In CVPR, 2019.
  16. Distractor-aware event-based tracking. TIP, 2023.
  17. Learning dual-fused modality-aware representations for rgbd tracking. In European Conference on Computer Vision, pages 478–494. Springer, 2022a.
  18. Aiatrack: Attention in attention for transformer visual tracking. In ECCV, 2022b.
  19. Deep adaptive fusion network for high performance RGBT tracking. In ICCVW, pages 0–0, 2019.
  20. Imagebind: One embedding space to bind them all. In CVPR, 2023.
  21. Fast-dynamic-vision: Detection and tracking dynamic objects with event and depth sensing. In IROS, 2021.
  22. Lora: Low-rank adaptation of large language models. In ICLR, 2022.
  23. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. TPAMI, 2019.
  24. Visual object tracking with discriminative filters and siamese networks: a survey and outlook. TPAMI, 45(5):6552–6574, 2022.
  25. Robust visual domain adaptation with low-rank reconstruction. In CVPR. IEEE, 2012.
  26. Visual prompt tuning. In ECCV. Springer, 2022.
  27. Exploring lightweight hierarchical vision transformers for efficient visual tracking. In ICCV, 2023.
  28. Maple: Multi-modal prompt learning. In CVPR, 2023.
  29. The eighth visual object tracking vot2020 challenge results. In ECCVW, pages 547–601. Springer, 2020.
  30. The ninth visual object tracking vot2021 challenge results. In ICCVW, pages 2711–2738, 2021.
  31. The tenth visual object tracking vot2022 challenge results. In ECCVW, pages 431–460. Springer, 2023.
  32. Multimodal prompting with missing modalities for visual recognition. In CVPR, 2023.
  33. High performance visual tracking with siamese region proposal network. In CVPR, 2018.
  34. SiamRPN++: Evolution of siamese visual tracking with very deep networks. In CVPR, 2019a.
  35. Weighted sparse representation regularized graph learning for RGB-T object tracking. In ACMMM, pages 1856–1864, 2017.
  36. RGB-T object tracking: Benchmark and baseline. Pattern Recognition, 96:106977, 2019b.
  37. Lasher: A large-scale high-diversity benchmark for RGBT tracking. TIP, 31:392–404, 2021.
  38. Learning deep multi-level similarity for thermal infrared object tracking. TMM, 23:2114–2126, 2020.
  39. Multi-adapter rgbt tracking. In ICCVW, 2019.
  40. Unified-io: A unified model for vision, language, and multi-modal tasks. In ICLR, 2023.
  41. Cdtb: A color and depth visual object tracking dataset and benchmark. In ICCV, 2019.
  42. Smil: Multimodal learning with severely missing modality. In AAAI, 2021.
  43. TrackingNet: A large-scale dataset and benchmark for object tracking in the wild. In ECCV, 2018.
  44. Get: Group event transformer for event-based vision. In ICCV, 2023.
  45. Modal-aware visual prompting for incomplete multi-modal brain tumor segmentation. In ACM MM, 2023.
  46. Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In CVPR, 2017.
  47. Vital: Visual tracking via adversarial learning. In CVPR, pages 8990–8999, 2018.
  48. Improving multiple pedestrian tracking by track management and occlusion handling. In CVPR, 2021.
  49. Revisiting color-event based tracking: A unified network, dataset, and metric. arXiv preprint arXiv:2211.11010, 2022a.
  50. Revisiting color-event based tracking: A unified network, dataset, and metric. arXiv preprint arXiv:2211.11010, 2022b.
  51. Siam R-CNN: Visual tracking by re-detection. In CVPR, 2020.
  52. Rpeflow: Multimodal fusion of rgb-pointcloud-event for joint optical flow and scene flow estimation. In ICCV, 2023.
  53. Cross-modal pattern-propagation for RGB-T tracking. In CVPR, pages 7064–7073, 2020.
  54. Multi-modal learning with missing modality via shared-specific feature modelling. In CVPR, 2023.
  55. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In CVPR, pages 1571–1580, 2021a.
  56. Visevent: Reliable object tracking via collaboration of frame and event flows. arXiv preprint arXiv:2108.05015, 2021b.
  57. Mfgnet: Dynamic modality-aware filter generation for rgb-t tracking. TMM, 2022a.
  58. Exploiting spatial sparsity for event cameras with visual transformers. In ICIP, 2022b.
  59. Unified multi-modal landmark tracking for tightly coupled lidar-visual-inertial odometry. RA-L, 6(2):1004–1011, 2021.
  60. Attribute-based progressive fusion network for RGBT tracking. In AAAI, 2022a.
  61. Attribute-based progressive fusion network for rgbt tracking. In AAAI, 2022b.
  62. Videotrack: Learning to track objects via video transformer. In CVPR, 2023.
  63. SiamFC++: Towards robust and accurate visual tracking with target estimation guidelines. In AAAI, 2020.
  64. Learning spatio-temporal transformer for visual tracking. In ICCV, 2021a.
  65. Learning spatio-temporal transformer for visual tracking. In ICCV, 2021b.
  66. Lighttrack: Finding lightweight neural networks for object tracking via one-shot architecture search. In CVPR, 2021c.
  67. Universal instance perception as object discovery and retrieval. In CVPR, 2023.
  68. Depthtrack: Unveiling the power of RGBD tracking. In ICCV, pages 10725–10733, 2021d.
  69. Depth-only object tracking. In BMVC, 2021e.
  70. Rgbd object tracking: An in-depth review. arXiv preprint arXiv:2203.14134, 2022a.
  71. Prompting for multi-modal tracking. In ACMMM, pages 3492–3500, 2022b.
  72. Resource-efficient rgbd aerial tracking. In CVPR, 2023.
  73. Video object segmentation and tracking: A survey. ACM TIST, 11(4):1–47, 2020.
  74. Joint feature learning and relation modeling for tracking: A one-stream framework. In ECCV, pages 341–357. Springer, 2022.
  75. M2dgr: A multi-sensor and multi-scenario slam dataset for ground robots. RA-L, 7(2):2266–2273, 2021.
  76. Deformable siamese attention networks for visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6728–6737, 2020.
  77. Tag-assisted multimodal sentiment analysis under uncertain missing modalities. In ACM SIGIR, 2022.
  78. All in one: Exploring unified vision-language tracking with multi-modal alignment. In ACM MM, 2023a.
  79. Object tracking in RGB-T videos using modal-aware attention network and competitive learning. Sensors, 20(2):393, 2020a.
  80. Object tracking by jointly exploiting frame and event domain. In ICCV, 2021.
  81. Spiking transformers for event-based single object tracking. In CVPR, 2022a.
  82. Delivering arbitrary-modal semantic segmentation. In CVPR, 2023b.
  83. Multi-modal fusion for end-to-end RGB-T tracking. In ICCVW, pages 0–0, 2019a.
  84. Multi-modal visual tracking: Review and experimental comparison. arXiv preprint arXiv:2012.04176, 2020b.
  85. Visible-thermal uav tracking: A large-scale benchmark and new baseline. In CVPR, 2022b.
  86. Robust multi-modality multi-object tracking. In ICCV, 2019b.
  87. Deeper and wider siamese networks for real-time visual tracking. In CVPR, 2019.
  88. Representation learning for visual object tracking by masked appearance transfer. In CVPR, 2023.
  89. A unified approach for tracking uavs in infrared. In ICCV, 2021a.
  90. Adaptive feature fusion for visual object tracking. PR, 111:107679, 2021b.
  91. Robust multi-modality person re-identification. In AAAI, 2021.
  92. Tracking anything in high quality. arXiv preprint arXiv:2307.13974, 2023a.
  93. Visual prompt multi-modal tracking. In CVPR, 2023b.
  94. Dcpt: Darkness clue-prompted tracking in nighttime uavs. arXiv preprint arXiv:2309.10491, 2023c.
  95. Robust visual object tracking via adaptive attribute-aware discriminative correlation filters. TMM, 24:301–312, 2021.
  96. RGBD1K: A large-scale dataset and benchmark for RGB-D object tracking. AAAI, 2023d.
  97. Rgbd1k: A large-scale dataset and benchmark for rgb-d object tracking. In AAAI, 2023e.
  98. Quality-aware feature aggregation network for robust RGBT tracking. IEEE Transactions on Intelligent Vehicles, 6(1):121–130, 2020.
  99. Learning graph-embedded key-event back-tracing for object tracking in event clouds. NeurIPS, 2022.
  100. Cross-modal orthogonal high-rank augmentation for rgb-event transformer-trackers. In ICCV, 2023f.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zongwei Wu (41 papers)
  2. Jilai Zheng (7 papers)
  3. Xiangxuan Ren (4 papers)
  4. Florin-Alexandru Vasluianu (11 papers)
  5. Chao Ma (187 papers)
  6. Danda Pani Paudel (94 papers)
  7. Luc Van Gool (569 papers)
  8. Radu Timofte (299 papers)
Citations (16)
Github Logo Streamline Icon: https://streamlinehq.com