Multi-Modal Visual Object Tracking

Updated 3 July 2026

MMVOT is a multi-sensor fusion problem that integrates diverse modalities such as RGB, thermal, depth, and event to reliably track objects in dynamic environments.
It employs fusion strategies at early, middle, or late stages, leveraging complementary cues to overcome challenges like occlusion and low illumination.
Unified tracking architectures use adapter modules and continual learning to efficiently manage sensor drops and enhance robustness across conditions.

Multi-Modal Visual Object Tracking (MMVOT) formalizes the problem of estimating the state of arbitrary objects in a video stream by fusing visual data from two or more modalities, such as RGB, thermal infrared, event, depth, near-infrared, sonar, or language. The core motivation is to leverage the complementary characteristics of heterogenous sensors—such as appearance cues, motion saliency, semantic priors, or physical structure—to robustly track objects in conditions where single modalities may fail due to occlusion, low illumination, background clutter, or dynamic scene changes. MMVOT thus encompasses a range of architectures, fusion strategies, and evaluation protocols, and is characterized by rapidly evolving benchmarks, unified model paradigms, and an active discourse over modality fusion benefits and pitfalls (Tang et al., 18 Aug 2025, Liang et al., 21 Jan 2026, 2012.04176, Wang et al., 2024).

1. Definitions, Scope, and Problem Formulation

Let $M$ denote the number of sensor modalities capturing observations $x^{(m)}_t$ at video frame $t$ . The MMVOT problem is to estimate the sequence of target states $\hat s_{1:T}$ (usually 2D/3D boxes or masks) maximizing the posterior over all observations:

$\{\,\hat s_{1:T}\,\} = \arg\max_{s_{1:T}} p\bigl(s_{1:T} \mid \{x_{1:T}^{(m)}\}_{m=1}^M\bigr).$

Assuming a Markov temporal structure and conditional independence among modalities, this decomposes as:

$\hat s_t = \arg\max_{s} p(s \mid \hat s_{t-1}) \prod_{m=1}^M p\bigl(x_t^{(m)} \mid s\bigr).$

In paradigm-specific instantiations—e.g., feature similarity frameworks—fusion typically appears as:

$\hat s_t = \arg\max_{s \in \Omega_t} \sum_{m=1}^M w_m \, \langle f^{(m)}(z), f^{(m)}(x_t(s)) \rangle,$

where $f^{(m)}$ is the feature extractor for modality $m$ , $z$ is the reference template, and $x^{(m)}_t$ 0 are fusion weights (Wang et al., 2024, Tang et al., 18 Aug 2025).

MMVOT encompasses various data tuples:

RGB-Thermal (RGBT)
RGB-Depth (RGBD)
RGB-Event (RGBE)
RGB-NIR, RGB-Language, RGB-LiDAR, RGB-Sonar

The output may be a 2D box, 3D pose, binary segmentation mask, or association for multi-object settings (Tang et al., 18 Aug 2025, 2012.04176, Wang et al., 2024).

2. Sensor Modalities, Data Collection, and Annotation

MMVOT datasets require careful hardware design for synchronized acquisition and spatial alignment of heterogeneous modalities:

Thermal Infrared (T): Captured by passive IR cameras, often used for low-illumination or night-time scenes.
Depth (D): Provided by structured light, ToF, or active stereo devices (Kinect, ZED). Depth maps capture geometric structure.
Event (E): High-frequency neuromorphic cameras output streams of $x^{(m)}_t$ 1 events—capturing instantaneous motion cues with minimal latency (Sun et al., 2024).
Near-Infrared (NIR), Sonar (S), Language (L): NIR is used in low-light imaging; sonar provides geometric cues in underwater or obstacle-rich settings; language provides semantic guidance via textual or spoken descriptors (Tang et al., 18 Aug 2025, Wang et al., 2024, Li et al., 2024).

Annotation challenges for MMVOT include:

Precise spatial calibration and extrinsic alignment of reference frames.
Synchronized acquisition across modalities (triggered capture/electronic sync).
Annotating multi-modal bounding boxes or masks, which may require transformation between sensor spaces.
Long-tail bias in object categories for non-RGB modalities (people/vehicles dominate; animal classes are rare) [(Tang et al., 18 Aug 2025), §4.2].

Language, if present, may be annotated at the frame, segment, or sequence level, and requires harmonization with temporal object tracks (Tang et al., 18 Aug 2025, Li et al., 2024).

3. Fusion Strategies and Model Architectures

MMVOT algorithms are categorized by the stage and nature of cross-modal information integration:

Fusion Level Taxonomy

Early fusion (input-level): Stack modalities along the channel axis and process jointly—typically for closely correlated sensor pairs (e.g., RGB-NIR, RGB-Thermal):

$x^{(m)}_t$ 2

Early fusion can blur physical distinctions between modalities and is less effective on strongly heterogeneous data (Wang et al., 2024, Tang et al., 18 Aug 2025, 2012.04176).

Middle fusion (feature-level): Separate backbones per modality extract $x^{(m)}_t$ 3 features, which are merged by concatenation, gating, attention, or correlation modules:

$x^{(m)}_t$ 4

Gating, attention-based, and frequency-aware modules are used to adaptively weight reliable cues in changing conditions (Liang et al., 21 Jan 2026, Xu et al., 30 Jun 2025, Hu et al., 10 Feb 2025, 2012.04176, Tang et al., 18 Aug 2025).

Late fusion (decision-level): Each modality processes independently, with final hypotheses fused by weighted sum or confidence re-ranking:

$x^{(m)}_t$ 5

Used in cases of unreliable modality calibration or extreme sensor heterogeneity (Wang et al., 2024).

Architectural Patterns

One-stream vs. two-stream backbones: Some models share weights between RGB and X branches (“equal modeling,” APTrack (Hu et al., 10 Feb 2025); OneTrackerV2 (Hong et al., 5 May 2026)); others use specialized backbones per stream (MMHT, ANN/SNN duality (Sun et al., 2024); spiking networks for events (Tang et al., 18 Aug 2025)).
Adapters, Prompts, and Lightweight Fusion Modules: State-of-the-art models now emphasize parameter-efficient adapters or prompt-tuning modules placed within frozen vision foundation models (ViT, OSTrack), e.g.:
- Symmetric adapters and self-distillation (SDSTrack (Hou et al., 2024))
- Dual visual+memory adapters (VMDA (Xu et al., 30 Jun 2025))
- Adapter-tuned spatio-temporal state space modules (UBATrack (Liang et al., 21 Jan 2026))
- Adaptive modality interaction using learnable tokens (APTrack (Hu et al., 10 Feb 2025))
- Mixture-of-Experts for modality/temporal separation (OneTrackerV2 (Hong et al., 5 May 2026))
Sequence-to-Sequence and Prompt-based Unification: Modern “unified trackers” such as SeqTrackv2 (Chen et al., 2023) and UBATrack (Liang et al., 21 Jan 2026) use a single model/prompt interface to handle all modality/task pairs with parameter-sharing.
Language Cue Integration: Multi-granularity language guidance via distillation or prompt tokens is used to improve association and robustness in both single- and multi-object tracking, as in LG-MOT (Li et al., 2024) and SeqTrackv2 (Chen et al., 2023).

4. Benchmark Datasets and Protocols

Major MMVOT datasets capture various sensor pairings, sequence lengths, alignments, and object categories. A selective listing, with details from (Tang et al., 18 Aug 2025, Wang et al., 2024, 2012.04176, Tang et al., 14 Aug 2025, Xu et al., 30 Jun 2025):

Dataset	Modalities	#Seq	#Frames	Alignment	Annotation
PTB	RGB, Depth	100	21.5K	✓	Box
DepthTrack	RGB, Depth	200	294.6K	✓	Box
LasHeR	RGB, Thermal	979	220.7K	✓	Box
RGBT234	RGB, Thermal	234	116.6K	✓	Box
VisEvent	RGB, Event	820	371.1K	✓	Box
TNL2K	RGB, Language	2000	1.2M	–	Box+Lang
COESOT	Frame, Event	827	527	✓	Box
UniBench300	RGB, T/D/E	300	368.1K	✓	Box

Evaluation protocols are generally inherited from classical tracking:

Precision Rate (PR): fraction of frames with center error below threshold.
Normalized Precision Rate (NPR): as PR, but center error normalized by diagonal.
Success Rate (SR): frames with IoU above threshold (typically 0.5).
EAO, Accuracy, Robustness: VOT protocols, especially for short-term and long-term re-initialization tasks.
F-score, Recall: used on DepthTrack and multi-object scenarios.

Many benchmarks are modality-specific; UniBench300 (Tang et al., 14 Aug 2025) provides an amalgam of RGBT, RGBD, and RGBE with harmonized evaluation.

Recent advances focus on unified models that generalize across tasks and modalities without retraining or hand-tuning:

Unified Prompt and Adapter Models: Models like SeqTrackv2 (Chen et al., 2023), UBATrack (Liang et al., 21 Jan 2026), OneTrackerV2 (Hong et al., 5 May 2026), and APTrack (Hu et al., 10 Feb 2025) present architectures where all modalities are merged via token-level, adapter, or prompt interaction, and tracked with a single set of parameters.
Meta Merger and Mixture-of-Experts: OneTrackerV2 uses a meta-embedding to absorb all modality features, and dual MoE blocks to segregate temporal and cross-modal information, further regularizing via router clustering losses (Hong et al., 5 May 2026).
Continual Unification and Knowledge Replay: Instead of parallel (all-at-once) training over mixed modalities, serial continual learning mitigates catastrophic forgetting and performance degradation in unified MMVOT models. Experiments on the UniBench300 benchmark show that continual unification with replay/distillation better preserves task-level accuracy, especially when modality heterogeneity is high (e.g., RGBT vs. RGBE) (Tang et al., 14 Aug 2025).

A comparative table highlighting SOTA unified models:

Model	Fusion Mechanism	SOTA Domains	# Trainable Params	FPS
UBATrack (Liang et al., 21 Jan 2026)	STMA Adapter + DMFM	RGB-T, RGB-D, RGB-E	11.9M	18–32
APTrack (Hu et al., 10 Feb 2025)	Equal modeling + AMI	RGBT, RGBD, RGBE	–	50.5
OneTrackerV2 (Hong et al., 5 May 2026)	Meta Merger + DMoE	RGB, RGBT, RGBD, RGBE, RGBN	80.2M / 40M (cmp)	72/159
SDSTrack (Hou et al., 2024)	Symmetric adapters + SD loss	RGB-T, RGB-D, RGB-E	14.8M	20.9
SeqTrackv2 (Chen et al., 2023)	Task-prompted transformer	All (incl. RGB-L)	–	5–40

6. Strengths, Limitations, and Critical Analysis

Strengths

Robustness to Adverse Conditions: Multi-modal fusion consistently outperforms single-modality baselines in low light (thermal dominant), fast-motion (event dominant), occlusion, and sensor degradation scenarios (Tang et al., 18 Aug 2025, Xu et al., 30 Jun 2025, Sun et al., 2024).
Parameter Efficiency: Adapter- and prompt-tuned models require only 10–15% of the parameters of full fine-tuning, allowing rapid adaptation and deployment (Hou et al., 2024, Liang et al., 21 Jan 2026, Xu et al., 30 Jun 2025).
Modality Dropout Resilience: Unified token or adapter designs degrade gracefully when one modality is corrupted or missing, maintaining >90% accuracy relative to full-modality input (Hong et al., 5 May 2026, Hou et al., 2024).

Limitations

Modality Quality Sensitivity: Poor-quality or misaligned auxiliary modalities can degrade performance below unimodal baselines, especially with naive fusion strategies (Tang et al., 18 Aug 2025).
Benchmark and Dataset Gaps: Public benchmarks lack diversity in object categories, especially for rare or non-human classes; large-scale aligned datasets for language and LiDAR fusion remain limited (Tang et al., 18 Aug 2025, Wang et al., 2024).
Resource Constraints: Multi-modal models are heavier than their RGB-only counterparts; efficient hardware-aware fusion remains an open area (Tang et al., 18 Aug 2025, Wang et al., 2024).
Domain Shift: Generalization across environments, sensors, and annotation protocols remains a key research agenda, motivating data-driven and continual unification approaches (Tang et al., 14 Aug 2025).

A plausible implication is that the full potential of MMVOT may be realized as large-scale, unified models combining visual, temporal, and semantic priors in a modular way, with strong robustness to sensor failure and occlusion.

7. Future Directions and Open Challenges

Research frontiers in MMVOT focus on:

Physics-Induced and Modality-Aware Architectures: Embedding sensor physical models (e.g., heat flow, photon statistics, event sparsity) directly into backbones for improved modality-specific robustness (Tang et al., 18 Aug 2025).
Real-Time, Edge Deployment: Designing lightweight (<1 ms latency) fusion blocks for hardware-constrained platforms such as UAV or robotics (Tang et al., 18 Aug 2025, Wang et al., 2024).
Continual Learning and Lifelong Unification: Serial continual frameworks with replay/distillation tackle the forgetting problem in multi-task, multi-modal domains (Tang et al., 14 Aug 2025).
Open-World and Language-Conditioned Tracking: Integrating multi-modal LLMs as flexible, real-time semantic fusion partners for context-aware, zero-shot tracking (Chen et al., 2023, Li et al., 2024, Tang et al., 18 Aug 2025).
Comprehensive, Multi-Modal Datasets: Collecting larger, well-annotated, long-tail datasets in underrepresented domains (animal, underwater, multi-sensor) (Tang et al., 18 Aug 2025, Wang et al., 2024).
Adaptive and Selective Fusion: Quality-aware gating, physics-based fusion, and online learning to dynamically select the most reliable modalities at inference (Tang et al., 18 Aug 2025, Liang et al., 21 Jan 2026, Tang et al., 14 Aug 2025).

In conclusion, MMVOT is defined by its pursuit of robust, scalable, and generalizable tracking through principled multi-modal data fusion, unified model paradigms, and a rapidly evolving landscape of benchmarks and tasks (Tang et al., 18 Aug 2025, 2012.04176, Wang et al., 2024, Hong et al., 5 May 2026, Liang et al., 21 Jan 2026, Tang et al., 14 Aug 2025, Hu et al., 10 Feb 2025).