- The paper introduces a dynamic state space model (DSSM) that adapts state transitions based on event density to improve temporal modeling in RGB-event tracking.
- It presents a gated projection fusion (GPF) module that aligns and weighs RGB and event features to suppress noise and preserve complementary information.
- Empirical results on FE108 and FELT datasets show significant improvements in success and precision rates, validating the model's robustness and efficiency.
MambaTrack: Event-Adaptive State Transition and Gated Fusion in Multimodal RGB-Event Object Tracking
Motivation and Problem Statement
The integration of event cameras with RGB sensors for visual tracking addresses significant limitations inherent in conventional RGB-based systems, especially under motion blur, extreme lighting, and redundant sampling conditions. Event cameras provide asynchronous, high temporal resolution data and wide dynamic range, which are crucial for accurate tracking in real-world scenarios. However, fusing event streams with RGB data poses substantial challenges due to the sparsity and asynchronicity of events and the lack of absolute brightness information. Existing Transformer-based RGB-Event (RGBE) trackers are constrained by static state transition models and quadratic complexity, yielding suboptimal temporal modeling and fusion robustness in the presence of varying event densities.
Methodology
Event-Adaptive Dynamic State Space Modeling
MambaTrack introduces a Dynamic State Space Model (DSSM) that adapts the state transition matrix to event stream density via a learnable scalar. Event density pt​ is computed per unit area within a time window, projected into latent space through a trainable matrix and sigmoid normalization, yielding a scaling factor B. A static prior matrix Abase​ stabilizes updates. The state transition matrix At​ is given by
At​=αBAbase​+(1−α)At−1​
where α is a learnable scalar. The final transition matrix integrates the original Vision Mamba model with Afinal​=At​+A, thus maintaining structural stability while providing adaptive temporal modeling responsive to event density variations.
Gated Projection Fusion (GPF) Module
The GPF module projects RGB features into the event domain using MLP-based alignment and generates adaptive gating coefficients from event density and RGB confidence (L2-norm). The gating coefficient G is computed as
G=Sigmoid(Wg​[p(t);∣FRGB∣2​])
This coefficient regulates fusion intensity, ensuring bidirectional weighted fusion that suppresses noise and preserves complementary features. Symmetrical fusion is performed for both modalities, resulting in concatenated fused features for the tracking head. The adaptive gating mechanism is crucial for balancing cross-modal interaction and filtering noisy data.
Asynchronous event streams are temporally aligned to RGB frame timestamps via time-surface representation, achieving spatiotemporal consistency. Modality-specific branches (static SSM for RGB, DSSM for events) are instantiated within Vision Mamba's hierarchical architecture. The tracking head is adopted from OSTrack, trained with a composite loss involving focal, L1, and GIoU objectives.
Experimental Validation
Quantitative Results and Ablation Analysis
MambaTrack demonstrates superior performance on FE108 and FELT datasets. On FELT, it attains a Success Rate (SR) of 42.5% and a Precision Rate (PR) of 54.0%. Compared to AFNet and ViPT, this is a notable improvement (+6.8% SR, +9.9% PR over ViPT), achieved with a more compact model. On FE108, the model yields 52.7% SR and 81.7% PR, besting DiMP and CMT-ATOM by 10.4 percentage points in precision.
Ablation studies show:
- Multimodal fusion advantage: RGB-only (41.3% SR, 52.5% PR), Event-only (33.4% SR, 41.2% PR), RGB-Event fusion (42.5% SR, 54.0% PR), confirming substantial complementarity and robustness.
- DSSM impact: Removing DSSM causes SR to drop from 42.5% to 42.1%, PR from 54.0% to 53.2%, indicating adaptive state modeling is beneficial for handling event density variations.
- GPF contribution: Excluding GPF leads to SR moving from 42.5% to 41.9%, PR from 54.0% to 53.4%, proving adaptive cross-modal fusion is critical for maximizing tracking accuracy.
Qualitative Evaluation
Visualizations validate robust performance under background interference, fast motion, partial occlusion, and small target conditions, reflecting the model’s temporal stability and multimodal adaptability.
Implications and Future Directions
MambaTrack’s event-adaptive DSSM and gated fusion framework establish an efficient paradigm for RGBE tracking with enhanced temporal and cross-modal adaptability. The lightweight, modular design facilitates real-time embedded deployment and scalability to complex environments. Practically, the model is well-suited for applications in autonomous systems, robotics, and surveillance, especially under challenging lighting and motion conditions.
From a theoretical perspective, adaptive state transition mechanisms in sequence modeling extend beyond vision tasks, suggesting broader applicability in multimodal signal processing. The fusion strategy balances complementary and redundant information, offering valuable insights into robust data integration in heterogeneous sensor networks.
Future work should explore specialized tracking heads optimized for event-based data and more sophisticated temporal-alignment strategies. Further research could investigate extending the event-adaptive modeling to other modalities and tasks, such as action recognition and simultaneous localization and mapping (SLAM), to drive robustness and generalization in multimodal AI systems.
Conclusion
MambaTrack advances RGB-Event object tracking by integrating a DSSM that dynamically modulates state transitions according to event density and a GPF module regulating cross-modal fusion strength via event density and RGB confidence. Empirical results demonstrate superior accuracy and robustness in long-term tracking benchmarks. The modular, lightweight approach promises real-time deployment and inspires future methods aimed at robust multimodal perception in dynamic and complex environments.