Event-Adaptive State Transition and Gated Fusion for RGB-Event Object Tracking

Published 15 Apr 2026 in cs.CV and cs.AI | (2604.13426v1)

Abstract: Existing Vision Mamba-based RGB-Event(RGBE) tracking methods suffer from using static state transition matrices, which fail to adapt to variations in event sparsity. This rigidity leads to imbalanced modeling-underfitting sparse event streams and overfitting dense ones-thus degrading cross-modal fusion robustness. To address these limitations, we propose MambaTrack, a multimodal and efficient tracking framework built upon a Dynamic State Space Model(DSSM). Our contributions are twofold. First, we introduce an event-adaptive state transition mechanism that dynamically modulates the state transition matrix based on event stream density. A learnable scalar governs the state evolution rate, enabling differentiated modeling of sparse and dense event flows. Second, we develop a Gated Projection Fusion(GPF) module for robust cross-modal integration. This module projects RGB features into the event feature space and generates adaptive gates from event density and RGB confidence scores. These gates precisely control the fusion intensity, suppressing noise while preserving complementary information. Experiments show that MambaTrack achieves state-of-the-art performance on the FE108 and FELT datasets. Its lightweight design suggests potential for real-time embedded deployment.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a dynamic state space model (DSSM) that adapts state transitions based on event density to improve temporal modeling in RGB-event tracking.
It presents a gated projection fusion (GPF) module that aligns and weighs RGB and event features to suppress noise and preserve complementary information.
Empirical results on FE108 and FELT datasets show significant improvements in success and precision rates, validating the model's robustness and efficiency.

MambaTrack: Event-Adaptive State Transition and Gated Fusion in Multimodal RGB-Event Object Tracking

Motivation and Problem Statement

The integration of event cameras with RGB sensors for visual tracking addresses significant limitations inherent in conventional RGB-based systems, especially under motion blur, extreme lighting, and redundant sampling conditions. Event cameras provide asynchronous, high temporal resolution data and wide dynamic range, which are crucial for accurate tracking in real-world scenarios. However, fusing event streams with RGB data poses substantial challenges due to the sparsity and asynchronicity of events and the lack of absolute brightness information. Existing Transformer-based RGB-Event (RGBE) trackers are constrained by static state transition models and quadratic complexity, yielding suboptimal temporal modeling and fusion robustness in the presence of varying event densities.

Methodology

Event-Adaptive Dynamic State Space Modeling

MambaTrack introduces a Dynamic State Space Model (DSSM) that adapts the state transition matrix to event stream density via a learnable scalar. Event density $p_t$ is computed per unit area within a time window, projected into latent space through a trainable matrix and sigmoid normalization, yielding a scaling factor $B$ . A static prior matrix $A_{base}$ stabilizes updates. The state transition matrix $A_t$ is given by

$A_t = \alpha B A_{base} + (1 - \alpha)A_{t-1}$

where $\alpha$ is a learnable scalar. The final transition matrix integrates the original Vision Mamba model with $A_{final} = A_t + A$ , thus maintaining structural stability while providing adaptive temporal modeling responsive to event density variations.

Gated Projection Fusion (GPF) Module

The GPF module projects RGB features into the event domain using MLP-based alignment and generates adaptive gating coefficients from event density and RGB confidence (L2-norm). The gating coefficient $G$ is computed as

$G = \mathrm{Sigmoid}(W_g [p(t); |\mathrm{FRGB}|_2])$

This coefficient regulates fusion intensity, ensuring bidirectional weighted fusion that suppresses noise and preserves complementary features. Symmetrical fusion is performed for both modalities, resulting in concatenated fused features for the tracking head. The adaptive gating mechanism is crucial for balancing cross-modal interaction and filtering noisy data.

Input Representation, Backbone, and Learning Pipeline

Asynchronous event streams are temporally aligned to RGB frame timestamps via time-surface representation, achieving spatiotemporal consistency. Modality-specific branches (static SSM for RGB, DSSM for events) are instantiated within Vision Mamba's hierarchical architecture. The tracking head is adopted from OSTrack, trained with a composite loss involving focal, L1, and GIoU objectives.

Experimental Validation

Quantitative Results and Ablation Analysis

MambaTrack demonstrates superior performance on FE108 and FELT datasets. On FELT, it attains a Success Rate (SR) of 42.5% and a Precision Rate (PR) of 54.0%. Compared to AFNet and ViPT, this is a notable improvement (+6.8% SR, +9.9% PR over ViPT), achieved with a more compact model. On FE108, the model yields 52.7% SR and 81.7% PR, besting DiMP and CMT-ATOM by 10.4 percentage points in precision.

Ablation studies show:

Multimodal fusion advantage: RGB-only (41.3% SR, 52.5% PR), Event-only (33.4% SR, 41.2% PR), RGB-Event fusion (42.5% SR, 54.0% PR), confirming substantial complementarity and robustness.
DSSM impact: Removing DSSM causes SR to drop from 42.5% to 42.1%, PR from 54.0% to 53.2%, indicating adaptive state modeling is beneficial for handling event density variations.
GPF contribution: Excluding GPF leads to SR moving from 42.5% to 41.9%, PR from 54.0% to 53.4%, proving adaptive cross-modal fusion is critical for maximizing tracking accuracy.

Qualitative Evaluation

Visualizations validate robust performance under background interference, fast motion, partial occlusion, and small target conditions, reflecting the model’s temporal stability and multimodal adaptability.

Implications and Future Directions

MambaTrack’s event-adaptive DSSM and gated fusion framework establish an efficient paradigm for RGBE tracking with enhanced temporal and cross-modal adaptability. The lightweight, modular design facilitates real-time embedded deployment and scalability to complex environments. Practically, the model is well-suited for applications in autonomous systems, robotics, and surveillance, especially under challenging lighting and motion conditions.

From a theoretical perspective, adaptive state transition mechanisms in sequence modeling extend beyond vision tasks, suggesting broader applicability in multimodal signal processing. The fusion strategy balances complementary and redundant information, offering valuable insights into robust data integration in heterogeneous sensor networks.

Future work should explore specialized tracking heads optimized for event-based data and more sophisticated temporal-alignment strategies. Further research could investigate extending the event-adaptive modeling to other modalities and tasks, such as action recognition and simultaneous localization and mapping (SLAM), to drive robustness and generalization in multimodal AI systems.

Conclusion

MambaTrack advances RGB-Event object tracking by integrating a DSSM that dynamically modulates state transitions according to event density and a GPF module regulating cross-modal fusion strength via event density and RGB confidence. Empirical results demonstrate superior accuracy and robustness in long-term tracking benchmarks. The modular, lightweight approach promises real-time deployment and inspires future methods aimed at robust multimodal perception in dynamic and complex environments.

Markdown Report Issue