Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Modal Visual Object Tracking

Updated 3 July 2026
  • MMVOT is a multi-sensor fusion problem that integrates diverse modalities such as RGB, thermal, depth, and event to reliably track objects in dynamic environments.
  • It employs fusion strategies at early, middle, or late stages, leveraging complementary cues to overcome challenges like occlusion and low illumination.
  • Unified tracking architectures use adapter modules and continual learning to efficiently manage sensor drops and enhance robustness across conditions.

Multi-Modal Visual Object Tracking (MMVOT) formalizes the problem of estimating the state of arbitrary objects in a video stream by fusing visual data from two or more modalities, such as RGB, thermal infrared, event, depth, near-infrared, sonar, or language. The core motivation is to leverage the complementary characteristics of heterogenous sensors—such as appearance cues, motion saliency, semantic priors, or physical structure—to robustly track objects in conditions where single modalities may fail due to occlusion, low illumination, background clutter, or dynamic scene changes. MMVOT thus encompasses a range of architectures, fusion strategies, and evaluation protocols, and is characterized by rapidly evolving benchmarks, unified model paradigms, and an active discourse over modality fusion benefits and pitfalls (Tang et al., 18 Aug 2025, Liang et al., 21 Jan 2026, 2012.04176, Wang et al., 2024).

1. Definitions, Scope, and Problem Formulation

Let MM denote the number of sensor modalities capturing observations xt(m)x^{(m)}_t at video frame tt. The MMVOT problem is to estimate the sequence of target states s^1:T\hat s_{1:T} (usually 2D/3D boxes or masks) maximizing the posterior over all observations:

{s^1:T}=argmaxs1:Tp(s1:T{x1:T(m)}m=1M).\{\,\hat s_{1:T}\,\} = \arg\max_{s_{1:T}} p\bigl(s_{1:T} \mid \{x_{1:T}^{(m)}\}_{m=1}^M\bigr).

Assuming a Markov temporal structure and conditional independence among modalities, this decomposes as:

s^t=argmaxsp(ss^t1)m=1Mp(xt(m)s).\hat s_t = \arg\max_{s} p(s \mid \hat s_{t-1}) \prod_{m=1}^M p\bigl(x_t^{(m)} \mid s\bigr).

In paradigm-specific instantiations—e.g., feature similarity frameworks—fusion typically appears as:

s^t=argmaxsΩtm=1Mwmf(m)(z),f(m)(xt(s)),\hat s_t = \arg\max_{s \in \Omega_t} \sum_{m=1}^M w_m \, \langle f^{(m)}(z), f^{(m)}(x_t(s)) \rangle,

where f(m)f^{(m)} is the feature extractor for modality mm, zz is the reference template, and xt(m)x^{(m)}_t0 are fusion weights (Wang et al., 2024, Tang et al., 18 Aug 2025).

MMVOT encompasses various data tuples:

  • RGB-Thermal (RGBT)
  • RGB-Depth (RGBD)
  • RGB-Event (RGBE)
  • RGB-NIR, RGB-Language, RGB-LiDAR, RGB-Sonar

The output may be a 2D box, 3D pose, binary segmentation mask, or association for multi-object settings (Tang et al., 18 Aug 2025, 2012.04176, Wang et al., 2024).

2. Sensor Modalities, Data Collection, and Annotation

MMVOT datasets require careful hardware design for synchronized acquisition and spatial alignment of heterogeneous modalities:

  • Thermal Infrared (T): Captured by passive IR cameras, often used for low-illumination or night-time scenes.
  • Depth (D): Provided by structured light, ToF, or active stereo devices (Kinect, ZED). Depth maps capture geometric structure.
  • Event (E): High-frequency neuromorphic cameras output streams of xt(m)x^{(m)}_t1 events—capturing instantaneous motion cues with minimal latency (Sun et al., 2024).
  • Near-Infrared (NIR), Sonar (S), Language (L): NIR is used in low-light imaging; sonar provides geometric cues in underwater or obstacle-rich settings; language provides semantic guidance via textual or spoken descriptors (Tang et al., 18 Aug 2025, Wang et al., 2024, Li et al., 2024).

Annotation challenges for MMVOT include:

  • Precise spatial calibration and extrinsic alignment of reference frames.
  • Synchronized acquisition across modalities (triggered capture/electronic sync).
  • Annotating multi-modal bounding boxes or masks, which may require transformation between sensor spaces.
  • Long-tail bias in object categories for non-RGB modalities (people/vehicles dominate; animal classes are rare) [(Tang et al., 18 Aug 2025), §4.2].

Language, if present, may be annotated at the frame, segment, or sequence level, and requires harmonization with temporal object tracks (Tang et al., 18 Aug 2025, Li et al., 2024).

3. Fusion Strategies and Model Architectures

MMVOT algorithms are categorized by the stage and nature of cross-modal information integration:

Fusion Level Taxonomy

  • Early fusion (input-level): Stack modalities along the channel axis and process jointly—typically for closely correlated sensor pairs (e.g., RGB-NIR, RGB-Thermal):

xt(m)x^{(m)}_t2

Early fusion can blur physical distinctions between modalities and is less effective on strongly heterogeneous data (Wang et al., 2024, Tang et al., 18 Aug 2025, 2012.04176).

  • Middle fusion (feature-level): Separate backbones per modality extract xt(m)x^{(m)}_t3 features, which are merged by concatenation, gating, attention, or correlation modules:

xt(m)x^{(m)}_t4

Gating, attention-based, and frequency-aware modules are used to adaptively weight reliable cues in changing conditions (Liang et al., 21 Jan 2026, Xu et al., 30 Jun 2025, Hu et al., 10 Feb 2025, 2012.04176, Tang et al., 18 Aug 2025).

  • Late fusion (decision-level): Each modality processes independently, with final hypotheses fused by weighted sum or confidence re-ranking:

xt(m)x^{(m)}_t5

Used in cases of unreliable modality calibration or extreme sensor heterogeneity (Wang et al., 2024).

Architectural Patterns

4. Benchmark Datasets and Protocols

Major MMVOT datasets capture various sensor pairings, sequence lengths, alignments, and object categories. A selective listing, with details from (Tang et al., 18 Aug 2025, Wang et al., 2024, 2012.04176, Tang et al., 14 Aug 2025, Xu et al., 30 Jun 2025):

Dataset Modalities #Seq #Frames Alignment Annotation
PTB RGB, Depth 100 21.5K Box
DepthTrack RGB, Depth 200 294.6K Box
LasHeR RGB, Thermal 979 220.7K Box
RGBT234 RGB, Thermal 234 116.6K Box
VisEvent RGB, Event 820 371.1K Box
TNL2K RGB, Language 2000 1.2M Box+Lang
COESOT Frame, Event 827 527 Box
UniBench300 RGB, T/D/E 300 368.1K Box

Evaluation protocols are generally inherited from classical tracking:

  • Precision Rate (PR): fraction of frames with center error below threshold.
  • Normalized Precision Rate (NPR): as PR, but center error normalized by diagonal.
  • Success Rate (SR): frames with IoU above threshold (typically 0.5).
  • EAO, Accuracy, Robustness: VOT protocols, especially for short-term and long-term re-initialization tasks.
  • F-score, Recall: used on DepthTrack and multi-object scenarios.

Many benchmarks are modality-specific; UniBench300 (Tang et al., 14 Aug 2025) provides an amalgam of RGBT, RGBD, and RGBE with harmonized evaluation.

5. Unified Multi-Modal Tracking Paradigms

Recent advances focus on unified models that generalize across tasks and modalities without retraining or hand-tuning:

  • Unified Prompt and Adapter Models: Models like SeqTrackv2 (Chen et al., 2023), UBATrack (Liang et al., 21 Jan 2026), OneTrackerV2 (Hong et al., 5 May 2026), and APTrack (Hu et al., 10 Feb 2025) present architectures where all modalities are merged via token-level, adapter, or prompt interaction, and tracked with a single set of parameters.
  • Meta Merger and Mixture-of-Experts: OneTrackerV2 uses a meta-embedding to absorb all modality features, and dual MoE blocks to segregate temporal and cross-modal information, further regularizing via router clustering losses (Hong et al., 5 May 2026).
  • Continual Unification and Knowledge Replay: Instead of parallel (all-at-once) training over mixed modalities, serial continual learning mitigates catastrophic forgetting and performance degradation in unified MMVOT models. Experiments on the UniBench300 benchmark show that continual unification with replay/distillation better preserves task-level accuracy, especially when modality heterogeneity is high (e.g., RGBT vs. RGBE) (Tang et al., 14 Aug 2025).

A comparative table highlighting SOTA unified models:

Model Fusion Mechanism SOTA Domains # Trainable Params FPS
UBATrack (Liang et al., 21 Jan 2026) STMA Adapter + DMFM RGB-T, RGB-D, RGB-E 11.9M 18–32
APTrack (Hu et al., 10 Feb 2025) Equal modeling + AMI RGBT, RGBD, RGBE 50.5
OneTrackerV2 (Hong et al., 5 May 2026) Meta Merger + DMoE RGB, RGBT, RGBD, RGBE, RGBN 80.2M / 40M (cmp) 72/159
SDSTrack (Hou et al., 2024) Symmetric adapters + SD loss RGB-T, RGB-D, RGB-E 14.8M 20.9
SeqTrackv2 (Chen et al., 2023) Task-prompted transformer All (incl. RGB-L) 5–40

6. Strengths, Limitations, and Critical Analysis

Strengths

Limitations

  • Modality Quality Sensitivity: Poor-quality or misaligned auxiliary modalities can degrade performance below unimodal baselines, especially with naive fusion strategies (Tang et al., 18 Aug 2025).
  • Benchmark and Dataset Gaps: Public benchmarks lack diversity in object categories, especially for rare or non-human classes; large-scale aligned datasets for language and LiDAR fusion remain limited (Tang et al., 18 Aug 2025, Wang et al., 2024).
  • Resource Constraints: Multi-modal models are heavier than their RGB-only counterparts; efficient hardware-aware fusion remains an open area (Tang et al., 18 Aug 2025, Wang et al., 2024).
  • Domain Shift: Generalization across environments, sensors, and annotation protocols remains a key research agenda, motivating data-driven and continual unification approaches (Tang et al., 14 Aug 2025).

A plausible implication is that the full potential of MMVOT may be realized as large-scale, unified models combining visual, temporal, and semantic priors in a modular way, with strong robustness to sensor failure and occlusion.

7. Future Directions and Open Challenges

Research frontiers in MMVOT focus on:

In conclusion, MMVOT is defined by its pursuit of robust, scalable, and generalizable tracking through principled multi-modal data fusion, unified model paradigms, and a rapidly evolving landscape of benchmarks and tasks (Tang et al., 18 Aug 2025, 2012.04176, Wang et al., 2024, Hong et al., 5 May 2026, Liang et al., 21 Jan 2026, Tang et al., 14 Aug 2025, Hu et al., 10 Feb 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Modal Visual Object Tracking (MMVOT).