Papers
Topics
Authors
Recent
Search
2000 character limit reached

UBATrack: Unified Multi-Modal Tracking

Updated 28 January 2026
  • UBATrack is a unified multi-modal tracking framework that integrates efficient state space modeling and dynamic fusion to leverage RGB, thermal, depth, and event data.
  • It employs a lightweight adapter-based tuning strategy that updates only about 12M parameters while keeping the ViT backbone frozen, ensuring parameter efficiency.
  • Dynamic multi-modal feature mixing in UBATrack captures long-range spatio-temporal cues to yield robust tracking performance under challenging conditions such as occlusion and poor illumination.

UBATrack is a unified multi-modal object tracking framework designed to address the challenges of leveraging diverse sensing modalities (RGB, thermal, depth, event cameras) for robust multi-object tracking in video. Integrating a parameter-efficient state space model “Mamba” architecture with dynamic fusion and adapter-based parameter tuning, UBATrack can handle RGB-T, RGB-D, and RGB-E tracking tasks in a single model, capturing long-range spatio-temporal cues and cross-modal dependencies efficiently while minimizing training overhead (Liang et al., 21 Jan 2026).

1. Spatio-Temporal Multi-modal Tracking and Motivation

Multi-modal tracking extends conventional RGB-only object tracking by leveraging additional sensor streams: thermal infrared (T), depth (D), or event cameras (E). While RGB trackers (e.g., SiamFC, OSTrack) perform well in standard conditions, they often degrade under occlusion, poor illumination, or camouflage. Prior multi-modal trackers fall into three main categories:

  • Modality-specific designs: TBSI (RGB-T), DeT (RGB-D), VisEvent (RGB-E), which lack generalizability for unseen modality combinations.
  • Prompt-learning-based unified trackers: ViPT, ProTrack, which are more flexible but lack explicit spatio-temporal modeling.
  • Dual-branch fully fine-tuned models: OneTracker, SDSTrack, which are parameter-heavy and computationally costly.

UBATrack addresses these limitations by (a) explicitly modeling long-range spatio-temporal and cross-modal dependencies using a Mamba-derived state space model (SSM), and (b) enabling parameter-efficient adaptation via lightweight adapters and a dynamic multi-modal fusion module. This approach maintains high modeling capacity while only updating ∼12M parameters and retaining a frozen ViT backbone.

2. Mathematical Core: State Space Model and STMA

2.1 Continuous-Discrete State Space Formulation

The foundation of UBATrack's temporal modeling is the structured state-space model (SSM). In continuous time, input sequence x(t)Rlx(t)\in\mathbb{R}^l evolves via:

h(t)=Ah(t)+Bx(t),y(t)=Ch(t)h'(t) = A h(t) + B x(t), \quad y(t) = C h(t)

where h(t)RNh(t)\in\mathbb{R}^N is the hidden state, ARN×NA\in\mathbb{R}^{N\times N} the transition, BRN×1B\in\mathbb{R}^{N\times 1} the input projection, CR1×NC\in\mathbb{R}^{1\times N} the output projection. Discretizing with step Δ\Delta:

A=exp(ΔA),B=(ΔA)1(exp(ΔA)I)B\overline A = \exp(\Delta A), \quad \overline B = (\Delta A)^{-1}(\exp(\Delta A)-I)B

ht=Aht1+Bxt,yt=Chth_t = \overline A h_{t-1} + \overline B x_t, \quad y_t = C h_t

Mamba extends this by making BB, CC, Δ\Delta input-dependent via per-token parameterization.

2.2 Spatio-Temporal Mamba Adapter (STMA)

The Spatio-Temporal Mamba Adapter (STMA) jointly models cross-modal fusion and temporal sequence dependencies. For fused token sequences xix_i (template zz, search ss) at transformer layer ii, each STMA comprises:

  1. Mamba Block (MBA): Temporal modeling via Mamba update with normalization, dropout, and residual connection:

xˉi=DropOut(FiMBA(Norm(xi)))+xi\bar x_i = \mathrm{DropOut}(\mathcal F_i^{\mathrm{MBA}}(\mathrm{Norm}(x_i))) + x_i

  1. MultiFFT Block: Frequency-domain channel mixing, leveraging FFT, complex Einstein-matrix multiplication (EMM), nonlinearities, and IFFT:

(xˉi,R,xˉi,I)=FFT(xˉi)(\bar x_{i,R}, \bar x_{i,I}) = \mathrm{FFT}(\bar x_i)

(x~i,R,x~i,I)=EMM((xˉi,R,xˉi,I),W,B)(\tilde x_{i,R}, \tilde x_{i,I}) = \mathrm{EMM}((\bar x_{i,R}, \bar x_{i,I}), W, B)

yi=σ(x~i,R,x~i,I)y_i = \sigma(\tilde x_{i,R}, \tilde x_{i,I})

(x^i,R,x^i,I)=EMM(yi,W,B)(\hat x_{i,R}, \hat x_{i,I}) = \mathrm{EMM}(y_i, W, B)

x^i=IFFT(x^i,R,x^i,I)\hat x_i = \mathrm{IFFT}(\hat x_{i,R}, \hat x_{i,I})

STMA is inserted in layers {2,4,6,8,10,12}\{2,4,6,8,10,12\} of a 12-layer ViT, thus fully propagating spatio-temporal and cross-modal cues at all major network stages.

3. Dynamic Multi-modal Feature Mixer (DMFM)

Following ViT+STMA encoding, the Dynamic Multi-modal Feature Mixer (DMFM) fuses concatenated search tokens x^Ns\hat x_N^s across modalities:

b^=GAP(FMix(x^Ns))\hat b = \mathrm{GAP}(\mathcal F^{\mathrm{Mix}}(\hat x_N^s))

Where FMix\mathcal F^{\mathrm{Mix}} implements:

  • MultiMixer Block: Simultaneous token and channel mixing by dynamic segment-wise mixing across width, height, and channels (sws_w, shs_h, scs_c):

s=FMixB(Norm(x^Ns))+x^Ns,s = \mathcal F^{\mathrm{MixB}}(\mathrm{Norm}(\hat x_N^s)) + \hat x_N^s,

s=ψ(sw+sh+sc)s = \psi(s_w + s_h + s_c)

  • Channel-MLP Block: Channel-wise mixing for final discriminative fusion.

Global average pooling (GAP) finalizes the fused representation for downstream prediction.

4. Architecture and Training Regimen

Architecture Overview:

Module Parameter Count Details
ViT Backbone frozen 12 layers, no parameter updates
STMA ≈0.018 M/block 6 layers inserted per ViT instance
DMFM ≈0.5 M Mixer + Channel-MLP at the network head
Prediction Head 3 conv branches Score, bbox, and offset maps
Total ∼11.9 M (all variants) E.g., SDSTrack uses 14.8 M

Only adapter modules and mixer are trained; backbone parameters remain fixed.

Training Protocol:

  • Training datasets: LasHeR (RGB-T), DepthTrack (RGB-D), VisEvent (RGB-E), mixed by 1:1:1 sampling.
  • Inputs: 3 reference (template) frames, 2 search frames per segment.
  • UBATrack-256: 128×128 template, 256×256 search, batch=16.
  • UBATrack-384: 192×192 template, 384×384 search, batch=8.
  • Optimizer: AdamW, weight_decay=1×1041\times10^{-4}, initial_lr=2×1042\times10^{-4}, decay ×0.1\times0.1 after epoch 10, totaling 15 epochs.
  • Loss:

L=Lcls+λ1L1+λ2LGIoUL = L_{cls} + \lambda_1 L_1 + \lambda_2 L_{GIoU}

with focal loss for LclsL_{cls}; L1L_1 box regression and GIoU overlap; λ1=5\lambda_1=5, λ2=2\lambda_2=2.

  • Templates are updated online by uniform interval sampling:

{0}{iT+T/2i=0..M1},T=Ci/M\{0\} \cup \{i\cdot T + \lfloor T/2 \rfloor \mid i=0..M{-}1\}, \quad T = \lfloor C_i/M \rfloor

for current frame CiC_i over MM templates.

5. Comparative Performance and Ablation Analysis

Benchmark Results

On six benchmarks covering three modality pairings, UBATrack achieves SOTA performance with significant parameter and speed efficiency.

Table: UBATrack and SOTA Comparison

Method Params (M) LasHeR SR (%) DepthTrack F (%) VisEvent MSR (%) FPS (V100)
UBATrack-384 11.9 60.1 67.3 62.7 18.4
UBATrack-256 11.9 58.2 63.5 60.4 32.5
SDSTrack 14.8 53.1 61.4 59.7 17.9
OneTracker 53.8 60.9

Ablation and Component Analysis

  • Component-wise Gains: Adding DMFM or STMA alone each leads to gains over baseline; their combination yields the best performance.
  • Alternative Designs: Compared to attention-based and Mamba-MLP baselines, STMA with MultiFFT achieves higher accuracy with orders-of-magnitude fewer parameters (0.018 M vs. 42.5 M).
  • STMA Depth: Six STMA layers optimize accuracy/FPS trade-off; deeper insertion reduces FPS with negligible further gain.
  • DMFM Fusion: DMFM outperforms conv, MLP, and attention fusions on all tracking modalities.

6. Strengths, Limitations, and Extensions

Strengths:

  • Unified tracking across RGB-T, RGB-D, and RGB-E with a single model and training schedule.
  • Adapter-based parameter-efficient tuning enables frozen backbone, reducing compute/memory costs.
  • Linear-complexity SSM captures long-range spatio-temporal and cross-modal dependencies.
  • Dynamic multi-modal fusion enhances discriminative tracking, especially under challenging sensor conditions.
  • Achieves SOTA across six datasets and maintains real-time inference (18–32 FPS).

Limitations:

  • Increased inference latency compared to pure RGB baselines due to SSM and fusion overhead.
  • Dependence on synchronized, aligned modalities; robustness to heavy misalignment or missing streams is unexplored.
  • Fixed layer STMA insertion; adaptive strategies might improve further.

Future Directions:

  • Integrating additional sensors (e.g., LiDAR, multi-modal hybrids) through new adapters.
  • Learning template update schedules within SSM.
  • Multi-scale/hierarchical SSMs for temporal granularity.
  • Embedding memory modules for long-term tracking.
  • Extending spatio-temporal adapters to related tasks such as action recognition or video segmentation.

UBATrack establishes that a parameter-efficient spatio-temporal state-space adapter, augmented by dynamic multi-modal feature fusion, enables unified, high-performance tracking across diverse sensing modalities while minimizing overhead and adaptation cost (Liang et al., 21 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UBATrack.