Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline (2309.14611v1)
Abstract: Tracking using bio-inspired event cameras has drawn more and more attention in recent years. Existing works either utilize aligned RGB and event data for accurate tracking or directly learn an event-based tracker. The first category needs more cost for inference and the second one may be easily influenced by noisy events or sparse spatial resolution. In this paper, we propose a novel hierarchical knowledge distillation framework that can fully utilize multi-modal / multi-view information during training to facilitate knowledge transfer, enabling us to achieve high-speed and low-latency visual tracking during testing by using only event signals. Specifically, a teacher Transformer-based multi-modal tracking framework is first trained by feeding the RGB frame and event stream simultaneously. Then, we design a new hierarchical knowledge distillation strategy which includes pairwise similarity, feature representation, and response maps-based knowledge distillation to guide the learning of the student Transformer network. Moreover, since existing event-based tracking datasets are all low-resolution ($346 \times 260$), we propose the first large-scale high-resolution ($1280 \times 720$) dataset named EventVOT. It contains 1141 videos and covers a wide range of categories such as pedestrians, vehicles, UAVs, ping pongs, etc. Extensive experiments on both low-resolution (FE240hz, VisEvent, COESOT), and our newly proposed high-resolution EventVOT dataset fully validated the effectiveness of our proposed method. The dataset, evaluation toolkit, and source code are available on \url{https://github.com/Event-AHU/EventVOT_Benchmark}
- Fully-convolutional siamese networks for object tracking. In European Conference on Computer Vision, page 850–865, 2016.
- Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF international conference on computer vision, page 6182–6191, 2019.
- Know your surroundings: Exploiting scene information for object tracking. In European Conference on Computer Vision, page 205–221, 2020.
- Backbone is all your need: A simplified architecture for visual object tracking. In European Conference on Computer Vision, page 375–392, 2021.
- Asynchronous tracking-by-detection on adaptive time surfaces for event-based object tracking. In Proceedings of the 27th ACM International Conference on Multimedia, pages 473–481, 2019.
- Teacher-student knowledge distillation for real-time correlation tracking. Neurocomputing, 500:537–546, 2022.
- Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, page 8126–8135, 2021.
- Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6668–6677, 2020.
- Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, page 13608–13618, 2022.
- Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, page 4660–4669, 2019.
- Probabilistic regression for visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, page 7183–7192, 2019.
- Learning from images: A distillation learning framework for event cameras. IEEE Transactions on Image Processing, 30:4919–4931, 2021.
- Event-based vision: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020.
- Aiatrack: Attention in attention for transformer visual tracking. In European Conference on Computer Vision, page 146–164, 2022.
- Distilling channels for efficient deep tracking. IEEE Transactions on Image Processing, 29:2610–2621, 2019.
- Eklt: Asynchronous photometric feature tracking using events and frames. International Journal of Computer Vision, 128(3):601–618, 2020.
- Knowledge distillation: A survey. International Journal of Computer Vision, 129:1789–1819, 2021.
- Event-guided structured output tracking of fast-moving objects using a celex sensor. IEEE Transactions on Circuits and Systems for Video Technology, 28(9):2413–2417, 2018.
- Cornernet: Detecting objects as paired keypoints. In Proceedings of the European conference on computer vision (ECCV), pages 734–750, 2018.
- High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, page 8971–8980, 2018.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
- Deep learning for visual tracking: A comprehensive survey. IEEE Transactions on Intelligent Transportation Systems, 23(5):3943–3968, 2021.
- Transforming model prediction for tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, page 8731–8740, 2022.
- Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4293–4302, 2016.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Distilled siamese networks for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):8896–8909, 2021.
- Unsupervised cross-modal distillation for thermal infrared tracking. In Proceedings of the 29th ACM International Conference on Multimedia, pages 2262–2270, 2021.
- Revisiting color-event based tracking: A unified network, dataset, and metric. arXiv preprint arXiv:2211.11010, 2022.
- Siamese instance search for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1420–1429, 2016.
- Real-time correlation tracking via joint model compression and transfer. IEEE Transactions on Image Processing, 29:6123–6135, 2020.
- Transformer meets tracker: Exploiting temporal context for robust visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, page 1571–1580, 2021.
- Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1328–1338, 2019.
- Visevent: Reliable object tracking via collaboration of frame and event flows. arXiv preprint arXiv:2108.05015, 2021.
- Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, page 10448–10457, 2021.
- Joint feature learning and relation modeling for tracking: A one-stream framework. In European Conference on Computer Vision, 2022.
- Spiking transformers for event-based single object tracking. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 8801–8810, 2022.
- Frame-event alignment and fusion network for high frame rate tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9781–9790, 2023.
- Object tracking by jointly exploiting frame and event domain. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13043–13052, 2021.
- Efficient rgb-t tracking via cross-modality distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5404–5413, 2023.
- Learn to match: Automatic matching network design for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, page 13339–13348, 2021.
- Ocean: Object-aware anchor-free tracking. In European Conference on Computer Vision, page 771–787, 2020.
- Distillation, ensemble and selection for building a better and faster siamese based tracker. IEEE transactions on circuits and systems for video technology, 2022.
- Visual prompt multi-modal tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9516–9526, 2023.
- Learning graph-embedded key-event back-tracing for object tracking in event clouds. Advances in Neural Information Processing Systems, 35:7462–7476, 2022.
- Cross-modal orthogonal high-rank augmentation for rgb-event transformer-trackers. arXiv preprint arXiv:2307.04129, 2023.
- Ensemble learning with siamese networks for visual tracking. Neurocomputing, 464:497–506, 2021.
- Xiao Wang (507 papers)
- Shiao Wang (16 papers)
- Chuanming Tang (9 papers)
- Lin Zhu (97 papers)
- Bo Jiang (235 papers)
- Yonghong Tian (184 papers)
- Jin Tang (139 papers)