UBATrack: Unified Multi-Modal Tracking

Updated 28 January 2026

UBATrack is a unified multi-modal tracking framework that integrates efficient state space modeling and dynamic fusion to leverage RGB, thermal, depth, and event data.
It employs a lightweight adapter-based tuning strategy that updates only about 12M parameters while keeping the ViT backbone frozen, ensuring parameter efficiency.
Dynamic multi-modal feature mixing in UBATrack captures long-range spatio-temporal cues to yield robust tracking performance under challenging conditions such as occlusion and poor illumination.

UBATrack is a unified multi-modal object tracking framework designed to address the challenges of leveraging diverse sensing modalities (RGB, thermal, depth, event cameras) for robust multi-object tracking in video. Integrating a parameter-efficient state space model “Mamba” architecture with dynamic fusion and adapter-based parameter tuning, UBATrack can handle RGB-T, RGB-D, and RGB-E tracking tasks in a single model, capturing long-range spatio-temporal cues and cross-modal dependencies efficiently while minimizing training overhead (Liang et al., 21 Jan 2026).

Multi-modal tracking extends conventional RGB-only object tracking by leveraging additional sensor streams: thermal infrared (T), depth (D), or event cameras (E). While RGB trackers (e.g., SiamFC, OSTrack) perform well in standard conditions, they often degrade under occlusion, poor illumination, or camouflage. Prior multi-modal trackers fall into three main categories:

Modality-specific designs: TBSI (RGB-T), DeT (RGB-D), VisEvent (RGB-E), which lack generalizability for unseen modality combinations.
Prompt-learning-based unified trackers: ViPT, ProTrack, which are more flexible but lack explicit spatio-temporal modeling.
Dual-branch fully fine-tuned models: OneTracker, SDSTrack, which are parameter-heavy and computationally costly.

UBATrack addresses these limitations by (a) explicitly modeling long-range spatio-temporal and cross-modal dependencies using a Mamba-derived state space model (SSM), and (b) enabling parameter-efficient adaptation via lightweight adapters and a dynamic multi-modal fusion module. This approach maintains high modeling capacity while only updating ∼12M parameters and retaining a frozen ViT backbone.

2. Mathematical Core: State Space Model and STMA

2.1 Continuous-Discrete State Space Formulation

The foundation of UBATrack's temporal modeling is the structured state-space model (SSM). In continuous time, input sequence $x(t)\in\mathbb{R}^l$ evolves via:

$h'(t) = A h(t) + B x(t), \quad y(t) = C h(t)$

where $h(t)\in\mathbb{R}^N$ is the hidden state, $A\in\mathbb{R}^{N\times N}$ the transition, $B\in\mathbb{R}^{N\times 1}$ the input projection, $C\in\mathbb{R}^{1\times N}$ the output projection. Discretizing with step $\Delta$ :

$\overline A = \exp(\Delta A), \quad \overline B = (\Delta A)^{-1}(\exp(\Delta A)-I)B$

$h_t = \overline A h_{t-1} + \overline B x_t, \quad y_t = C h_t$

Mamba extends this by making $B$ , $C$ , $\Delta$ input-dependent via per-token parameterization.

2.2 Spatio-Temporal Mamba Adapter (STMA)

The Spatio-Temporal Mamba Adapter (STMA) jointly models cross-modal fusion and temporal sequence dependencies. For fused token sequences $x_i$ (template $z$ , search $s$ ) at transformer layer $i$ , each STMA comprises:

Mamba Block (MBA): Temporal modeling via Mamba update with normalization, dropout, and residual connection:

$\bar x_i = \mathrm{DropOut}(\mathcal F_i^{\mathrm{MBA}}(\mathrm{Norm}(x_i))) + x_i$

MultiFFT Block: Frequency-domain channel mixing, leveraging FFT, complex Einstein-matrix multiplication (EMM), nonlinearities, and IFFT:

$(\bar x_{i,R}, \bar x_{i,I}) = \mathrm{FFT}(\bar x_i)$

$(\tilde x_{i,R}, \tilde x_{i,I}) = \mathrm{EMM}((\bar x_{i,R}, \bar x_{i,I}), W, B)$

$y_i = \sigma(\tilde x_{i,R}, \tilde x_{i,I})$

$(\hat x_{i,R}, \hat x_{i,I}) = \mathrm{EMM}(y_i, W, B)$

$\hat x_i = \mathrm{IFFT}(\hat x_{i,R}, \hat x_{i,I})$

STMA is inserted in layers $\{2,4,6,8,10,12\}$ of a 12-layer ViT, thus fully propagating spatio-temporal and cross-modal cues at all major network stages.

Following ViT+STMA encoding, the Dynamic Multi-modal Feature Mixer (DMFM) fuses concatenated search tokens $\hat x_N^s$ across modalities:

$\hat b = \mathrm{GAP}(\mathcal F^{\mathrm{Mix}}(\hat x_N^s))$

Where $\mathcal F^{\mathrm{Mix}}$ implements:

MultiMixer Block: Simultaneous token and channel mixing by dynamic segment-wise mixing across width, height, and channels ( $s_w$ , $s_h$ , $s_c$ ):

$s = \mathcal F^{\mathrm{MixB}}(\mathrm{Norm}(\hat x_N^s)) + \hat x_N^s,$

$s = \psi(s_w + s_h + s_c)$

Channel-MLP Block: Channel-wise mixing for final discriminative fusion.

Global average pooling (GAP) finalizes the fused representation for downstream prediction.

4. Architecture and Training Regimen

Architecture Overview:

Module	Parameter Count	Details
ViT Backbone	frozen	12 layers, no parameter updates
STMA	≈0.018 M/block	6 layers inserted per ViT instance
DMFM	≈0.5 M	Mixer + Channel-MLP at the network head
Prediction Head	3 conv branches	Score, bbox, and offset maps
Total	∼11.9 M (all variants)	E.g., SDSTrack uses 14.8 M

Only adapter modules and mixer are trained; backbone parameters remain fixed.

Training Protocol:

Training datasets: LasHeR (RGB-T), DepthTrack (RGB-D), VisEvent (RGB-E), mixed by 1:1:1 sampling.
Inputs: 3 reference (template) frames, 2 search frames per segment.
UBATrack-256: 128×128 template, 256×256 search, batch=16.
UBATrack-384: 192×192 template, 384×384 search, batch=8.
Optimizer: AdamW, weight_decay= $1\times10^{-4}$ , initial_lr= $2\times10^{-4}$ , decay $\times0.1$ after epoch 10, totaling 15 epochs.
Loss:

$L = L_{cls} + \lambda_1 L_1 + \lambda_2 L_{GIoU}$

with focal loss for $L_{cls}$ ; $L_1$ box regression and GIoU overlap; $\lambda_1=5$ , $\lambda_2=2$ .

Templates are updated online by uniform interval sampling:

$\{0\} \cup \{i\cdot T + \lfloor T/2 \rfloor \mid i=0..M{-}1\}, \quad T = \lfloor C_i/M \rfloor$

for current frame $C_i$ over $M$ templates.

5. Comparative Performance and Ablation Analysis

Benchmark Results

On six benchmarks covering three modality pairings, UBATrack achieves SOTA performance with significant parameter and speed efficiency.

Table: UBATrack and SOTA Comparison

Method	Params (M)	LasHeR SR (%)	DepthTrack F (%)	VisEvent MSR (%)	FPS (V100)
UBATrack-384	11.9	60.1	67.3	62.7	18.4
UBATrack-256	11.9	58.2	63.5	60.4	32.5
SDSTrack	14.8	53.1	61.4	59.7	17.9
OneTracker	–	53.8	60.9	–	–

Ablation and Component Analysis

Component-wise Gains: Adding DMFM or STMA alone each leads to gains over baseline; their combination yields the best performance.
Alternative Designs: Compared to attention-based and Mamba-MLP baselines, STMA with MultiFFT achieves higher accuracy with orders-of-magnitude fewer parameters (0.018 M vs. 42.5 M).
STMA Depth: Six STMA layers optimize accuracy/FPS trade-off; deeper insertion reduces FPS with negligible further gain.
DMFM Fusion: DMFM outperforms conv, MLP, and attention fusions on all tracking modalities.

6. Strengths, Limitations, and Extensions

Strengths:

Unified tracking across RGB-T, RGB-D, and RGB-E with a single model and training schedule.
Adapter-based parameter-efficient tuning enables frozen backbone, reducing compute/memory costs.
Linear-complexity SSM captures long-range spatio-temporal and cross-modal dependencies.
Dynamic multi-modal fusion enhances discriminative tracking, especially under challenging sensor conditions.
Achieves SOTA across six datasets and maintains real-time inference (18–32 FPS).

Limitations:

Increased inference latency compared to pure RGB baselines due to SSM and fusion overhead.
Dependence on synchronized, aligned modalities; robustness to heavy misalignment or missing streams is unexplored.
Fixed layer STMA insertion; adaptive strategies might improve further.

Future Directions:

Integrating additional sensors (e.g., LiDAR, multi-modal hybrids) through new adapters.
Learning template update schedules within SSM.
Multi-scale/hierarchical SSMs for temporal granularity.
Embedding memory modules for long-term tracking.
Extending spatio-temporal adapters to related tasks such as action recognition or video segmentation.

UBATrack establishes that a parameter-efficient spatio-temporal state-space adapter, augmented by dynamic multi-modal feature fusion, enables unified, high-performance tracking across diverse sensing modalities while minimizing overhead and adaptation cost (Liang et al., 21 Jan 2026).

Markdown Upgrade to Chat

References (1)

UBATrack: Spatio-Temporal State Space Model for General Multi-Modal Tracking (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UBATrack.

UBATrack: Unified Multi-Modal Tracking

2. Mathematical Core: State Space Model and STMA

2.1 Continuous-Discrete State Space Formulation

2.2 Spatio-Temporal Mamba Adapter (STMA)

4. Architecture and Training Regimen

5. Comparative Performance and Ablation Analysis

Benchmark Results

Ablation and Component Analysis

6. Strengths, Limitations, and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

UBATrack: Unified Multi-Modal Tracking

1. Spatio-Temporal Multi-modal Tracking and Motivation

2. Mathematical Core: State Space Model and STMA

2.1 Continuous-Discrete State Space Formulation

2.2 Spatio-Temporal Mamba Adapter (STMA)

3. Dynamic Multi-modal Feature Mixer (DMFM)

4. Architecture and Training Regimen

5. Comparative Performance and Ablation Analysis

Benchmark Results

Ablation and Component Analysis

6. Strengths, Limitations, and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research