UBATrack: Unified Multi-Modal Tracking
- UBATrack is a unified multi-modal tracking framework that integrates efficient state space modeling and dynamic fusion to leverage RGB, thermal, depth, and event data.
- It employs a lightweight adapter-based tuning strategy that updates only about 12M parameters while keeping the ViT backbone frozen, ensuring parameter efficiency.
- Dynamic multi-modal feature mixing in UBATrack captures long-range spatio-temporal cues to yield robust tracking performance under challenging conditions such as occlusion and poor illumination.
UBATrack is a unified multi-modal object tracking framework designed to address the challenges of leveraging diverse sensing modalities (RGB, thermal, depth, event cameras) for robust multi-object tracking in video. Integrating a parameter-efficient state space model “Mamba” architecture with dynamic fusion and adapter-based parameter tuning, UBATrack can handle RGB-T, RGB-D, and RGB-E tracking tasks in a single model, capturing long-range spatio-temporal cues and cross-modal dependencies efficiently while minimizing training overhead (Liang et al., 21 Jan 2026).
1. Spatio-Temporal Multi-modal Tracking and Motivation
Multi-modal tracking extends conventional RGB-only object tracking by leveraging additional sensor streams: thermal infrared (T), depth (D), or event cameras (E). While RGB trackers (e.g., SiamFC, OSTrack) perform well in standard conditions, they often degrade under occlusion, poor illumination, or camouflage. Prior multi-modal trackers fall into three main categories:
- Modality-specific designs: TBSI (RGB-T), DeT (RGB-D), VisEvent (RGB-E), which lack generalizability for unseen modality combinations.
- Prompt-learning-based unified trackers: ViPT, ProTrack, which are more flexible but lack explicit spatio-temporal modeling.
- Dual-branch fully fine-tuned models: OneTracker, SDSTrack, which are parameter-heavy and computationally costly.
UBATrack addresses these limitations by (a) explicitly modeling long-range spatio-temporal and cross-modal dependencies using a Mamba-derived state space model (SSM), and (b) enabling parameter-efficient adaptation via lightweight adapters and a dynamic multi-modal fusion module. This approach maintains high modeling capacity while only updating ∼12M parameters and retaining a frozen ViT backbone.
2. Mathematical Core: State Space Model and STMA
2.1 Continuous-Discrete State Space Formulation
The foundation of UBATrack's temporal modeling is the structured state-space model (SSM). In continuous time, input sequence evolves via:
where is the hidden state, the transition, the input projection, the output projection. Discretizing with step :
Mamba extends this by making , , input-dependent via per-token parameterization.
2.2 Spatio-Temporal Mamba Adapter (STMA)
The Spatio-Temporal Mamba Adapter (STMA) jointly models cross-modal fusion and temporal sequence dependencies. For fused token sequences (template , search ) at transformer layer , each STMA comprises:
- Mamba Block (MBA): Temporal modeling via Mamba update with normalization, dropout, and residual connection:
- MultiFFT Block: Frequency-domain channel mixing, leveraging FFT, complex Einstein-matrix multiplication (EMM), nonlinearities, and IFFT:
STMA is inserted in layers of a 12-layer ViT, thus fully propagating spatio-temporal and cross-modal cues at all major network stages.
3. Dynamic Multi-modal Feature Mixer (DMFM)
Following ViT+STMA encoding, the Dynamic Multi-modal Feature Mixer (DMFM) fuses concatenated search tokens across modalities:
Where implements:
- MultiMixer Block: Simultaneous token and channel mixing by dynamic segment-wise mixing across width, height, and channels (, , ):
- Channel-MLP Block: Channel-wise mixing for final discriminative fusion.
Global average pooling (GAP) finalizes the fused representation for downstream prediction.
4. Architecture and Training Regimen
Architecture Overview:
| Module | Parameter Count | Details |
|---|---|---|
| ViT Backbone | frozen | 12 layers, no parameter updates |
| STMA | ≈0.018 M/block | 6 layers inserted per ViT instance |
| DMFM | ≈0.5 M | Mixer + Channel-MLP at the network head |
| Prediction Head | 3 conv branches | Score, bbox, and offset maps |
| Total | ∼11.9 M (all variants) | E.g., SDSTrack uses 14.8 M |
Only adapter modules and mixer are trained; backbone parameters remain fixed.
Training Protocol:
- Training datasets: LasHeR (RGB-T), DepthTrack (RGB-D), VisEvent (RGB-E), mixed by 1:1:1 sampling.
- Inputs: 3 reference (template) frames, 2 search frames per segment.
- UBATrack-256: 128×128 template, 256×256 search, batch=16.
- UBATrack-384: 192×192 template, 384×384 search, batch=8.
- Optimizer: AdamW, weight_decay=, initial_lr=, decay after epoch 10, totaling 15 epochs.
- Loss:
with focal loss for ; box regression and GIoU overlap; , .
- Templates are updated online by uniform interval sampling:
for current frame over templates.
5. Comparative Performance and Ablation Analysis
Benchmark Results
On six benchmarks covering three modality pairings, UBATrack achieves SOTA performance with significant parameter and speed efficiency.
Table: UBATrack and SOTA Comparison
| Method | Params (M) | LasHeR SR (%) | DepthTrack F (%) | VisEvent MSR (%) | FPS (V100) |
|---|---|---|---|---|---|
| UBATrack-384 | 11.9 | 60.1 | 67.3 | 62.7 | 18.4 |
| UBATrack-256 | 11.9 | 58.2 | 63.5 | 60.4 | 32.5 |
| SDSTrack | 14.8 | 53.1 | 61.4 | 59.7 | 17.9 |
| OneTracker | – | 53.8 | 60.9 | – | – |
Ablation and Component Analysis
- Component-wise Gains: Adding DMFM or STMA alone each leads to gains over baseline; their combination yields the best performance.
- Alternative Designs: Compared to attention-based and Mamba-MLP baselines, STMA with MultiFFT achieves higher accuracy with orders-of-magnitude fewer parameters (0.018 M vs. 42.5 M).
- STMA Depth: Six STMA layers optimize accuracy/FPS trade-off; deeper insertion reduces FPS with negligible further gain.
- DMFM Fusion: DMFM outperforms conv, MLP, and attention fusions on all tracking modalities.
6. Strengths, Limitations, and Extensions
Strengths:
- Unified tracking across RGB-T, RGB-D, and RGB-E with a single model and training schedule.
- Adapter-based parameter-efficient tuning enables frozen backbone, reducing compute/memory costs.
- Linear-complexity SSM captures long-range spatio-temporal and cross-modal dependencies.
- Dynamic multi-modal fusion enhances discriminative tracking, especially under challenging sensor conditions.
- Achieves SOTA across six datasets and maintains real-time inference (18–32 FPS).
Limitations:
- Increased inference latency compared to pure RGB baselines due to SSM and fusion overhead.
- Dependence on synchronized, aligned modalities; robustness to heavy misalignment or missing streams is unexplored.
- Fixed layer STMA insertion; adaptive strategies might improve further.
Future Directions:
- Integrating additional sensors (e.g., LiDAR, multi-modal hybrids) through new adapters.
- Learning template update schedules within SSM.
- Multi-scale/hierarchical SSMs for temporal granularity.
- Embedding memory modules for long-term tracking.
- Extending spatio-temporal adapters to related tasks such as action recognition or video segmentation.
UBATrack establishes that a parameter-efficient spatio-temporal state-space adapter, augmented by dynamic multi-modal feature fusion, enables unified, high-performance tracking across diverse sensing modalities while minimizing overhead and adaptation cost (Liang et al., 21 Jan 2026).