MambaEVT: Event Stream based Visual Object Tracking using State Space Model (2408.10487v1)

Published 20 Aug 2024 in cs.CV and cs.AI

Abstract: Event camera-based visual tracking has drawn more and more attention in recent years due to the unique imaging principle and advantages of low energy consumption, high dynamic range, and dense temporal resolution. Current event-based tracking algorithms are gradually hitting their performance bottlenecks, due to the utilization of vision Transformer and the static template for target object localization. In this paper, we propose a novel Mamba-based visual tracking framework that adopts the state space model with linear complexity as a backbone network. The search regions and target template are fed into the vision Mamba network for simultaneous feature extraction and interaction. The output tokens of search regions will be fed into the tracking head for target localization. More importantly, we consider introducing a dynamic template update strategy into the tracking framework using the Memory Mamba network. By considering the diversity of samples in the target template library and making appropriate adjustments to the template memory module, a more effective dynamic template can be integrated. The effective combination of dynamic and static templates allows our Mamba-based tracking algorithm to achieve a good balance between accuracy and computational cost on multiple large-scale datasets, including EventVOT, VisEvent, and FE240hz. The source code will be released on https://github.com/Event-AHU/MambaEVT

Authors (7)

Xiao Wang (507 papers)
Chao Wang (555 papers)
Shiao Wang (16 papers)
Xixi Wang (11 papers)
Zhicheng Zhao (34 papers)
Lin Zhu (97 papers)
Bo Jiang (235 papers)

Citations (1)

View on Semantic Scholar

Summary

The paper presents a novel MambaEVT framework that uses event streams and a state space model to achieve linear computational complexity and improved tracking efficiency.
It employs a dynamic template update mechanism via the Memory Mamba network to maintain accuracy under rapid appearance changes.
Experimental results on EventVOT, FE240hz, and VisEvent datasets demonstrate robust performance and superior efficiency compared to state-of-the-art methods.

MambaEVT: Event Stream based Visual Object Tracking using State Space Model

The paper "MambaEVT: Event Stream based Visual Object Tracking using State Space Model" introduces an advanced framework for visual object tracking utilizing event cameras. This method leverages a novel Mamba-based approach combined with a state space model, emphasizing efficient feature extraction and interaction. MambaEVT integrates a dynamic template update strategy, a sophisticated enhancement to address traditional challenges in event-based visual tracking.

Methodology Overview

The core innovation of this framework is the application of the Mamba model, known for its lower computational complexity $\mathcal{O}(N)$ , compared to the $\mathcal{O}(N^2)$ complexity of vision Transformers. The Mamba model's linear complexity makes it an appealing choice for real-time or resource-constrained applications, such as on-device or embedded systems.

In MambaEVT, the initial template and search region are extracted from the event streams and projected into event tokens. These tokens are then processed via the Vision Mamba backbone, which facilitates simultaneous feature extraction and interactive learning. The output tokens corresponding to the search regions are fed into a tracking head for target localization.

A notable contribution is the introduction of a dynamic template update mechanism, utilizing the Memory Mamba network. This mechanism maintains a template library, dynamically updating it based on the diversity of the samples. The dynamic template, generated through a Memory Mamba, is integrated with a static template to enhance tracking performance, particularly in scenarios involving significant appearance variations.

Experimental Results

MambaEVT was extensively evaluated against state-of-the-art (SOTA) trackers across several large-scale event-based datasets, including EventVOT, VisEvent, and FE240hz. The results demonstrated its efficacy and computational efficiency.

EventVOT: MambaEVT achieved a success rate (SR) of 56.5, a precision rate (PR) of 56.7, and a normalized precision rate (NPR) of 65.5, surpassing several SOTA methods such as OSTrack while utilizing significantly fewer parameters (29.3M vs 92.1M for OSTrack).
FE240hz: The tracker recorded an SR of 58.09 and a PR of 91.97, with commendable performance compared to ARTrack and AQATrack. This proves the model's capability to handle high-dynamic-range scenarios characteristic of the FE240hz dataset.
VisEvent: On the VisEvent dataset, MambaEVT (and its variant MambaEVT-P) demonstrated robust performance under diverse tracking conditions, highlighting its adaptability and robustness.

Implications and Future Work

The integration of the Memory Mamba network into the tracking framework signifies a substantial advancement in handling dynamic object appearances. By dynamically updating templates based on observed changes, MambaEVT not only improves accuracy but also maintains computational efficiency.

From a theoretical perspective, the use of state space models in visual tracking could open new avenues for research, particularly in areas requiring low-latency and low-power consumption tracking solutions. Practically, this framework could be pivotal in advancing applications in large-scale intelligent surveillance, military operations, and aerospace, where tracking efficiency and reliability are paramount.

Future developments might focus on further reducing the computational footprint and energy consumption, potentially incorporating spiking neural networks (SNNs) to achieve even greater efficiency. Additionally, refining the current training process to fully leverage the Mamba model's capabilities could lead to enhanced model performance across various challenging environments.

Conclusion

In summary, the MambaEVT framework presents a sophisticated approach to event-based visual tracking, balancing accuracy and computational cost effectively. Its robust performance across multiple datasets and scenarios demonstrates the viability of integrating state space models with dynamic template updates in advanced tracking systems. This contribution not only propels the state of the art in event-based tracking but also lays the groundwork for future innovations in this domain.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1826084342981877842

https://twitter.com/CSVisionPapers/status/1826253218298425839

https://twitter.com/arXivGPT/status/1826740346392338441