Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline (2309.14611v1)

Published 26 Sep 2023 in cs.CV and cs.NE

Abstract: Tracking using bio-inspired event cameras has drawn more and more attention in recent years. Existing works either utilize aligned RGB and event data for accurate tracking or directly learn an event-based tracker. The first category needs more cost for inference and the second one may be easily influenced by noisy events or sparse spatial resolution. In this paper, we propose a novel hierarchical knowledge distillation framework that can fully utilize multi-modal / multi-view information during training to facilitate knowledge transfer, enabling us to achieve high-speed and low-latency visual tracking during testing by using only event signals. Specifically, a teacher Transformer-based multi-modal tracking framework is first trained by feeding the RGB frame and event stream simultaneously. Then, we design a new hierarchical knowledge distillation strategy which includes pairwise similarity, feature representation, and response maps-based knowledge distillation to guide the learning of the student Transformer network. Moreover, since existing event-based tracking datasets are all low-resolution ($346 \times 260$), we propose the first large-scale high-resolution ($1280 \times 720$) dataset named EventVOT. It contains 1141 videos and covers a wide range of categories such as pedestrians, vehicles, UAVs, ping pongs, etc. Extensive experiments on both low-resolution (FE240hz, VisEvent, COESOT), and our newly proposed high-resolution EventVOT dataset fully validated the effectiveness of our proposed method. The dataset, evaluation toolkit, and source code are available on \url{https://github.com/Event-AHU/EventVOT_Benchmark}

References (46)

Authors (7)

Xiao Wang (507 papers)
Shiao Wang (16 papers)
Chuanming Tang (9 papers)
Lin Zhu (97 papers)
Bo Jiang (235 papers)
Yonghong Tian (184 papers)
Jin Tang (139 papers)

Citations (27)

View on Semantic Scholar

Summary

The paper introduces a Transformer-based tracking framework that transfers multi-modal RGB-event knowledge to an event-only student model via hierarchical distillation.
It employs a three-part distillation strategy—pairwise similarity, feature representation, and response map—to optimize tracking accuracy and speed.
The high-resolution EventVOT dataset, with 1141 videos at 1280×720 resolution, establishes a new benchmark for robust object tracking under challenging conditions.

Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline

The research presented in the paper seeks to address the challenges of Visual Object Tracking (VOT) using event cameras. Traditional RGB-based object tracking often encounters difficulties in scenarios involving rapid motion, low illumination, background distractions, and objects moving out-of-frame. In contrast, event cameras, inspired by biological vision systems, offer asynchronous outputs with high temporal resolution, which make them suitable for fast motion tracking and challenging lighting conditions.

This paper introduces a novel event-based tracking framework utilizing hierarchical knowledge distillation from multi-modal data. The proposed methodology involves a Transformer-based tracking network composed of a teacher-student architecture. Initially, the teacher Transformer network is trained on synchronized RGB frame and event stream data, offering a multi-modal fusion with feature learning capabilities. This network extracts feature representations using dual modalities to integrate information effectively. The hierarchical knowledge distillation strategy is applied to transfer the learned knowledge from the teacher model to a student model that operates solely on event data for low-latency and high-speed tracking.

A prominent highlight of this research is the introduction of a high-resolution dataset named EventVOT. Unlike previous event-based datasets limited by resolution, EventVOT delivers a substantial increase with a $1280 \times 720$ resolution, encompassing 1141 videos across varied categories, from pedestrians to vehicles and UAVs. This benchmark dataset aims to provide extensive data for training and evaluation, facilitating performance improvements and method validation in high-resolution settings.

The core methodological contribution is the hierarchical knowledge distillation technique comprising three components: pairwise similarity, feature representation, and response map-based distillation. This construct guides the student's learning effectively by mimicking these aspects from the high-capacity teacher model. Evaluation on both low-resolution benchmarks (FE240hz, VisEvent, COESOT) and the new high-resolution EventVOT demonstrates the efficacy of the distillation strategy. Specifically, the results show substantial improvements in tracking accuracy and speed, asserting the capability of the event-only student model to achieve competitive performance.

Abundant experimental results validate the robustness of the proposed method. The design allows effective event-based tracking with significant gains in tracking accuracy (e.g., measured in standard metrics such as SR and PR) and processing efficiency, facilitating practical applications requiring rapid processing with minimal latency. This is particularly pertinent for autonomous systems where computational resources might be constrained.

The introduction of the high-resolution EventVOT dataset presents substantial implications for the field of event-based vision. It establishes a new benchmark for evaluating tracking performances specific to event data, providing a comprehensive environment for comparison and fostering adoption of event-based strategies. The public availability of the dataset, evaluation toolkits, and source code encourages wide usage and further advancements in the field.

Future directions for AI in visual tracking, prompted by this research, may involve advanced distillation strategies incorporating more complex inter-modality interactions and exploring self-supervised learning paradigms tailored for high-resolution event data. The interplay of artificial intelligence methodologies and bio-inspired sensor dynamics offers promising avenues for enhancing both theoretical understanding and practical implementations across disciplines, including surveillance, autonomous vehicles, and robotics.

Moreover, the proposed framework's implications extend to improving real-time tracking efficiencies by utilizing the unique properties of event-based sensors, which offer considerable advantages over traditional methods, especially in dynamic and resource-constrained environments.

PDF Markdown

Related Papers

GitHub

GitHub - Event-AHU/COESOT: A large-scale benchmark dataset for color-event based visual tracking (62 stars)

YouTube

Show All Videos