Multi-Modal IMM: Fusion for Tracking & Forecasting

Updated 2 May 2026

Multi-Modal IMM is an advanced estimation framework that fuses multiple state-space models and diverse sensor modalities using Bayesian techniques to handle dynamic and irregular data.
It runs parallel model-conditioned filters with adaptive weighting strategies, significantly improving precision in visual SLAM, multi-object tracking, and time series forecasting.
Practical implementations demonstrate marked performance gains in handling asynchrony and computational complexity, making it a robust solution in various domains.

A Multi-Modal Interacting Multiple Model (IMM) algorithm refers to any estimation or forecasting architecture that extends the IMM principle to handle either multiple physical, statistical, or sensory models (as in classical IMM), multiple data modalities (e.g., numeric, text, visual), or both. Multi-modal IMMs have emerged in diverse domains, including object tracking, visual SLAM, multimodal time series forecasting, cooperative localization with probabilistic sensors, and trajectory prediction. The essential tenet is to run parallel model-conditioned filters—each tailored to a behavioral, measurement, or data-modality hypothesis—and probabilistically fuse their outputs using mixture theory grounded in Bayesian statistics or adaptive data-driven strategies.

1. Mathematical Foundations of the IMM Algorithm

At its core, the IMM method targets systems whose governing dynamics, noise profiles, or measurement models are best described by a family of $M$ parallel state-space models indexed by mode $j$ . Each mode $j$ is defined by its discrete Markovian switching process:

$x_{k+1}^{(j)} = F^{(j)} x_k^{(j)} + G^{(j)} w_k^{(j)},\qquad z_k = H^{(j)} x_k^{(j)} + v_k^{(j)}$

with process and measurement noise $w_k^{(j)} \sim \mathcal{N}(0, Q^{(j)})$ and $v_k^{(j)} \sim \mathcal{N}(0, R^{(j)})$ , and with the mode transition matrix $\Pi = [\pi_{ij}]$ . Bayesian filtering proceeds by (1) mixing: blending previous model-conditioned states using the Markov chain, (2) model-conditioned prediction and update, (3) mode probability update via measurement likelihoods, and (4) output fusion. The general fusion equations are

$x_k = \sum_{j=1}^M \mu_k^{(j)} x_{k|k}^{(j)},\qquad P_k = \sum_{j=1}^M \mu_k^{(j)} \left[ P_{k|k}^{(j)} + (x_{k|k}^{(j)} - x_k)(x_{k|k}^{(j)} - x_k)^\top \right]$

Extensions cover nonlinear models (IMM-EKF, IMM-UKF), multiple sensors, model-adaptive transition matrices, and higher-order Markov processes (Dingler, 2022).

Multi-modal IMMs are extensively used in multi-object tracking (MOT), vision-based SLAM, and joint state estimation tasks requiring both model and data-modality fusion:

Visual SLAMMOT with IMM-EKF: The IMM is embedded as a bank of EKFs per object, each corresponding to a distinct motion model (e.g., constant position, constant velocity, constant turn-rate and velocity). Model probabilities are recomputed at every step based on measurement likelihoods, and their outputs are fused in a weighted factor graph. All variables (ego pose, map points, object states) are jointly optimized, with residuals weighted by IMM model probabilities. This approach yields improved pose and tracking precision, particularly during abrupt motion-pattern transitions (Tian et al., 2024).
IMM-JHSE Joint Homography and Multi-Modal Data Association: This class of algorithms fuses dynamic/static camera homography models (applied as an IMM over homography subspaces) with parallel image-plane (bounding box) predictors. IMM-mixed state vectors are 12-dimensional, jointly updating object ground-plane position, homography, and velocity. The IMM architecture extends beyond simple state fusion: in data association, matching scores are computed by mixing Mahalanobis distances (from the ground-plane branch) and buffered IoU scores (from the image-plane branch) via IMM-style weights, allowing the system to switch its primary association metric dynamically based on which modality is more reliable for the current scene (Claasen et al., 2024).
Decoupled Multi-Hierarchy Kalman IMM (DIMM): Here, “multi-modal” refers both to model and spatial anisotropy. DIMM splits the 3D tracking state into independent axes ( $x$ , $y$ , $j$ 0), applies a bank of multiple-order KFs (CV, CA, CJ) per axis, and then fuses their outputs across a high-dimensional “hypercube” of weights (three simplices, one per axis). Fusion weights are determined by a learned attention-based RL network—rather than mode likelihoods—allowing axis-specific, data-driven mixing and significant improvement in challenging tracking environments. The expanded solution space, from a simplex in classical IMM to a hypercube in DIMM, enables better representation of complex object maneuvers with axis-dependent motion (Zha et al., 18 May 2025).

The Time-IMM and IMM-TSF frameworks generalize IMM-style fusion to the domain of irregular, multimodal time series forecasting:

Data Taxonomy and Irregularity: Time-IMM comprises nine real-world datasets, each capturing a unique cause of time series irregularity (trigger-based, constraint-based, artifact-based) across both numerical and textual channels. The raw streams are asynchronous and have highly variable sampling rates and missingness; no imposed alignment exists between modalities.
Architecture: IMM-TSF employs modular fusion similar in spirit to IMM: (a) a backbone model for numeric streams, (b) frozen LLM-based text encoder, (c) timestamp-to-text fusion (TTF)—recency-weighted or cross-attention—which aligns text features to numeric timestamp queries, and (d) multimodality fusion (MMF), which fuses the numeric and aligned text predictions. All fusion operations maintain asynchrony and recency-awareness; no aligned regular grid is forced.
Performance: Across the nine datasets and twelve baselines, explicitly modeling multimodality via IMM-TSF results in mean squared error reductions averaging 6.71% and up to 38.38% for text-informative datasets. Empirically, learnable gating (GRU-gated residual addition) outperforms attention-based fusion, and the specific backbone text encoder choice is not critical under significant irregularity—temporal alignment governs utility more than semantic model capacity (Chang et al., 12 Jun 2025).

4. IMM with Sensor-Specific Modalities: Cooperative Localization

In cooperative localization with UWB ranging, sensor-specific measurement models (e.g., line-of-sight vs. non-line-of-sight) are handled via a multi-modal IMM:

Structure: Every agent models their state evolution with a base INS process. When a UWB range measurement is available, two measurement models are considered: unbiased for LoS, biased for NLoS. The likelihood of the mode (LoS/NLoS) is given by a discriminator. Each branch (LoS/NLoS) updates in parallel, and model weights are formed by the product of discriminator prior and likelihood.
Mixing: Here, mixing is simplified: the discriminator outputs are memoryless (no explicit Markov transition), so mode priors replace transition-matrix-based mixing.
Fusion: IMM-weighted sums of state and covariance are computed after each parallel update. The approach is robust to NLoS misclassifications, with the IMM automatically down-weighting the NLoS correction when the innovation likelihood is low.
Extendability: Additional modes (e.g., multi-path, specific obstructions) can be incorporated by adding parallel filters and expanding the mixing/fusion steps accordingly (Zhu et al., 2020).

5. Deep Learning and Surrogate IMM Structures

Neural architectures may serve as multi-modal IMM surrogates:

RNN-based IMM Surrogates: These models replace per-mode Kalman filtering and mode probability evolution with a recurrent encoder-decoder that directly emits a multi-modal trajectory distribution, conditioned on inferred mode probabilities. The RNN is trained to jointly filter the current (noisy) state, infer maneuver (mode) probabilities, and output mixture distributions for future trajectories, matching the Bayesian mixture structure of classical IMM but learned end-to-end. The architecture yields sharply improved short-term prediction accuracy in maneuver-critical regimes (Becker et al., 2019).

6. Algorithmic Innovations and Comparative Performance

Recent works have proposed innovations to the multi-modal IMM framework:

Extension/Variant	Key Mechanism	Performance/Impact
Visual SLAMMOT IMM	Model-weighted factor graph optimization, per-object IMM-EKF bank	Lower Absolute Pose Error and MOTP, robust to motion transitions
IMM-JHSE	Joint homography+image-plane filter, IMM-mixed association costs	Outperforms multi-object tracking baselines on DanceTrack and KITTI
DIMM	3D axis-decoupled IMM, RL-based adaptive fusion	31.61–99.23% improvement in position MSE over best baselines
IMM-TSF	Multimodal fusion for TSF via recency-aware and cross-attn modules	6.71–38.38% MSE reductions on irregular multimodal datasets
RNN-IMM Surrogate	RNN encoder/decoder, learned mode-probabilities	FDE improvements over hand-crafted IMM, especially for abrupt moves

These innovations expand the representational and adaptational capacity of the standard IMM, enabling robust estimation in high-dimensional, multimodal, and noisy regimes.

Critical issues for research and application in multi-modal IMM include:

Combinatorial Explosion: As the number of sensor/feature modalities and internal models grows, the number of parallel filters and parameters can increase rapidly.
Adaptive Weighting: Replacing likelihood-based mixing with data-driven (RL, attention) fusers can improve accuracy under complex, non-Gaussian or intermittent data, but raises questions of interpretability and stability.
Measurement Synchrony: Directly accommodating asynchrony (in time or modality space) is non-trivial; hybrid fusion modules (IMM-TSF) or association heuristics (IMM-JHSE) are necessary.
Real-time Performance: Maintaining computational tractability is essential; approaches such as memoryless mode priors or decoupling spatial axes can help.
Generalization to Arbitrary Modality Sets: Modular architectures (such as those in IMM-TSF and DIMM) that allow pluggable sensors, models, or data types are especially valuable for deployment but require careful design of mixing/fusion logic per new domain.

Multi-modal IMM research is highly active, with continuing advances in statistical filtering, multimodal fusion, and neural surrogate models across applications ranging from autonomous navigation to real-world forecasting and cooperative multi-agent tracking (Dingler, 2022, Zha et al., 18 May 2025, Chang et al., 12 Jun 2025, Tian et al., 2024, Claasen et al., 2024, Zhu et al., 2020, Becker et al., 2019).