Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline (2309.14611v1)

Published 26 Sep 2023 in cs.CV and cs.NE

Abstract: Tracking using bio-inspired event cameras has drawn more and more attention in recent years. Existing works either utilize aligned RGB and event data for accurate tracking or directly learn an event-based tracker. The first category needs more cost for inference and the second one may be easily influenced by noisy events or sparse spatial resolution. In this paper, we propose a novel hierarchical knowledge distillation framework that can fully utilize multi-modal / multi-view information during training to facilitate knowledge transfer, enabling us to achieve high-speed and low-latency visual tracking during testing by using only event signals. Specifically, a teacher Transformer-based multi-modal tracking framework is first trained by feeding the RGB frame and event stream simultaneously. Then, we design a new hierarchical knowledge distillation strategy which includes pairwise similarity, feature representation, and response maps-based knowledge distillation to guide the learning of the student Transformer network. Moreover, since existing event-based tracking datasets are all low-resolution ($346 \times 260$), we propose the first large-scale high-resolution ($1280 \times 720$) dataset named EventVOT. It contains 1141 videos and covers a wide range of categories such as pedestrians, vehicles, UAVs, ping pongs, etc. Extensive experiments on both low-resolution (FE240hz, VisEvent, COESOT), and our newly proposed high-resolution EventVOT dataset fully validated the effectiveness of our proposed method. The dataset, evaluation toolkit, and source code are available on \url{https://github.com/Event-AHU/EventVOT_Benchmark}

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Fully-convolutional siamese networks for object tracking. In European Conference on Computer Vision, page 850–865, 2016.
  2. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF international conference on computer vision, page 6182–6191, 2019.
  3. Know your surroundings: Exploiting scene information for object tracking. In European Conference on Computer Vision, page 205–221, 2020.
  4. Backbone is all your need: A simplified architecture for visual object tracking. In European Conference on Computer Vision, page 375–392, 2021.
  5. Asynchronous tracking-by-detection on adaptive time surfaces for event-based object tracking. In Proceedings of the 27th ACM International Conference on Multimedia, pages 473–481, 2019.
  6. Teacher-student knowledge distillation for real-time correlation tracking. Neurocomputing, 500:537–546, 2022.
  7. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, page 8126–8135, 2021.
  8. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6668–6677, 2020.
  9. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, page 13608–13618, 2022.
  10. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, page 4660–4669, 2019.
  11. Probabilistic regression for visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, page 7183–7192, 2019.
  12. Learning from images: A distillation learning framework for event cameras. IEEE Transactions on Image Processing, 30:4919–4931, 2021.
  13. Event-based vision: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020.
  14. Aiatrack: Attention in attention for transformer visual tracking. In European Conference on Computer Vision, page 146–164, 2022.
  15. Distilling channels for efficient deep tracking. IEEE Transactions on Image Processing, 29:2610–2621, 2019.
  16. Eklt: Asynchronous photometric feature tracking using events and frames. International Journal of Computer Vision, 128(3):601–618, 2020.
  17. Knowledge distillation: A survey. International Journal of Computer Vision, 129:1789–1819, 2021.
  18. Event-guided structured output tracking of fast-moving objects using a celex sensor. IEEE Transactions on Circuits and Systems for Video Technology, 28(9):2413–2417, 2018.
  19. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European conference on computer vision (ECCV), pages 734–750, 2018.
  20. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, page 8971–8980, 2018.
  21. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  22. Deep learning for visual tracking: A comprehensive survey. IEEE Transactions on Intelligent Transportation Systems, 23(5):3943–3968, 2021.
  23. Transforming model prediction for tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, page 8731–8740, 2022.
  24. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4293–4302, 2016.
  25. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  26. Distilled siamese networks for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):8896–8909, 2021.
  27. Unsupervised cross-modal distillation for thermal infrared tracking. In Proceedings of the 29th ACM International Conference on Multimedia, pages 2262–2270, 2021.
  28. Revisiting color-event based tracking: A unified network, dataset, and metric. arXiv preprint arXiv:2211.11010, 2022.
  29. Siamese instance search for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1420–1429, 2016.
  30. Real-time correlation tracking via joint model compression and transfer. IEEE Transactions on Image Processing, 29:6123–6135, 2020.
  31. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, page 1571–1580, 2021.
  32. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1328–1338, 2019.
  33. Visevent: Reliable object tracking via collaboration of frame and event flows. arXiv preprint arXiv:2108.05015, 2021.
  34. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, page 10448–10457, 2021.
  35. Joint feature learning and relation modeling for tracking: A one-stream framework. In European Conference on Computer Vision, 2022.
  36. Spiking transformers for event-based single object tracking. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 8801–8810, 2022.
  37. Frame-event alignment and fusion network for high frame rate tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9781–9790, 2023.
  38. Object tracking by jointly exploiting frame and event domain. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13043–13052, 2021.
  39. Efficient rgb-t tracking via cross-modality distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5404–5413, 2023.
  40. Learn to match: Automatic matching network design for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, page 13339–13348, 2021.
  41. Ocean: Object-aware anchor-free tracking. In European Conference on Computer Vision, page 771–787, 2020.
  42. Distillation, ensemble and selection for building a better and faster siamese based tracker. IEEE transactions on circuits and systems for video technology, 2022.
  43. Visual prompt multi-modal tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9516–9526, 2023.
  44. Learning graph-embedded key-event back-tracing for object tracking in event clouds. Advances in Neural Information Processing Systems, 35:7462–7476, 2022.
  45. Cross-modal orthogonal high-rank augmentation for rgb-event transformer-trackers. arXiv preprint arXiv:2307.04129, 2023.
  46. Ensemble learning with siamese networks for visual tracking. Neurocomputing, 464:497–506, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xiao Wang (507 papers)
  2. Shiao Wang (16 papers)
  3. Chuanming Tang (9 papers)
  4. Lin Zhu (97 papers)
  5. Bo Jiang (235 papers)
  6. Yonghong Tian (184 papers)
  7. Jin Tang (139 papers)
Citations (27)

Summary

  • The paper introduces a Transformer-based tracking framework that transfers multi-modal RGB-event knowledge to an event-only student model via hierarchical distillation.
  • It employs a three-part distillation strategy—pairwise similarity, feature representation, and response map—to optimize tracking accuracy and speed.
  • The high-resolution EventVOT dataset, with 1141 videos at 1280×720 resolution, establishes a new benchmark for robust object tracking under challenging conditions.

Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline

The research presented in the paper seeks to address the challenges of Visual Object Tracking (VOT) using event cameras. Traditional RGB-based object tracking often encounters difficulties in scenarios involving rapid motion, low illumination, background distractions, and objects moving out-of-frame. In contrast, event cameras, inspired by biological vision systems, offer asynchronous outputs with high temporal resolution, which make them suitable for fast motion tracking and challenging lighting conditions.

This paper introduces a novel event-based tracking framework utilizing hierarchical knowledge distillation from multi-modal data. The proposed methodology involves a Transformer-based tracking network composed of a teacher-student architecture. Initially, the teacher Transformer network is trained on synchronized RGB frame and event stream data, offering a multi-modal fusion with feature learning capabilities. This network extracts feature representations using dual modalities to integrate information effectively. The hierarchical knowledge distillation strategy is applied to transfer the learned knowledge from the teacher model to a student model that operates solely on event data for low-latency and high-speed tracking.

A prominent highlight of this research is the introduction of a high-resolution dataset named EventVOT. Unlike previous event-based datasets limited by resolution, EventVOT delivers a substantial increase with a 1280×7201280 \times 720 resolution, encompassing 1141 videos across varied categories, from pedestrians to vehicles and UAVs. This benchmark dataset aims to provide extensive data for training and evaluation, facilitating performance improvements and method validation in high-resolution settings.

The core methodological contribution is the hierarchical knowledge distillation technique comprising three components: pairwise similarity, feature representation, and response map-based distillation. This construct guides the student's learning effectively by mimicking these aspects from the high-capacity teacher model. Evaluation on both low-resolution benchmarks (FE240hz, VisEvent, COESOT) and the new high-resolution EventVOT demonstrates the efficacy of the distillation strategy. Specifically, the results show substantial improvements in tracking accuracy and speed, asserting the capability of the event-only student model to achieve competitive performance.

Abundant experimental results validate the robustness of the proposed method. The design allows effective event-based tracking with significant gains in tracking accuracy (e.g., measured in standard metrics such as SR and PR) and processing efficiency, facilitating practical applications requiring rapid processing with minimal latency. This is particularly pertinent for autonomous systems where computational resources might be constrained.

The introduction of the high-resolution EventVOT dataset presents substantial implications for the field of event-based vision. It establishes a new benchmark for evaluating tracking performances specific to event data, providing a comprehensive environment for comparison and fostering adoption of event-based strategies. The public availability of the dataset, evaluation toolkits, and source code encourages wide usage and further advancements in the field.

Future directions for AI in visual tracking, prompted by this research, may involve advanced distillation strategies incorporating more complex inter-modality interactions and exploring self-supervised learning paradigms tailored for high-resolution event data. The interplay of artificial intelligence methodologies and bio-inspired sensor dynamics offers promising avenues for enhancing both theoretical understanding and practical implementations across disciplines, including surveillance, autonomous vehicles, and robotics.

Moreover, the proposed framework's implications extend to improving real-time tracking efficiencies by utilizing the unique properties of event-based sensors, which offer considerable advantages over traditional methods, especially in dynamic and resource-constrained environments.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com