Multi-Object Tracking Overview

Updated 11 May 2026

Multi-object tracking is a process that estimates object locations and unique identities in video sequences using detection, data association, and feature fusion techniques.
It employs key paradigms like tracking-by-detection and joint detection-association, utilizing methods such as Kalman filters, transformer models, and probabilistic assignment.
Recent advances incorporate spatiotemporal modeling and multi-modal fusion (e.g., LiDAR-camera integration), enhancing robustness in diverse real-world scenarios.

Multi-object tracking (MOT) is the computational task of estimating both the localization—typically as bounding boxes or pixel-wise masks—and the unique identities of multiple objects as they move through a sequence of video frames. This process is essential for a wide range of real-time and offline applications in intelligent video analytics, automotive perception, robotics, surveillance, biological imaging, and more. The recent literature on arXiv demonstrates a rapidly evolving landscape, with advances in tracking paradigms (tracking-by-detection, joint detection-and-association, tracking-by-re-detection), data association strategies, spatiotemporal modeling, feature learning, and application domains ranging from urban driving to aerial and low-light surveillance.

1. Paradigms and Core Methodologies

The field is broadly structured along two key paradigms:

Tracking-By-Detection (TBD): This remains the dominant approach, wherein objects are first localized per-frame by an external detector and then associated temporally through frame-to-frame data association. Notable high-accuracy, high-speed instances include ByteTrack (MOTA 80.3, IDF1 77.3, HOTA 63.1 on MOT17 at 30 FPS, V100) (Luo et al., 2022). TBD approaches typically employ Kalman filters or similar motion models for prediction and solve framewise assignment using metrics like IoU, Mahalanobis distance, or learned deep feature similarities.
Joint Detection and Association: These methods, including Transformer-based architectures and diffusion models, treat detection and association as a unified, often end-to-end trainable task. Examples include query-based association transformers (Cao et al., 2024), diffusion-based denoising frameworks (Luo et al., 2023), and temporally conditioned convolutions or attention modules.

Several enhanced TBP pipelines have emerged:

Motion-Aware Architectures: MOT pipelines such as MAT explicitly decouple and fuse rigid camera motion (via ECC-based affine registration) and nonrigid object motion (Kalman filtering), improving performance under camera ego-motion, fast-object motion, and occlusion. MAT integrates a dynamic reconnection window and a 3D integral-image structure for association pruning, yielding reductions in false negatives and identity switches (Han et al., 2020).
Deterministic-Stochastic Hybrid Models: GenTrack augments deterministic data association with stochastic particle swarm optimization to handle non-linear, non-Gaussian target motion and dynamically manage ID consistency, particularly in the presence of dense occlusion and variable target counts (Nguyen et al., 28 Oct 2025).

2. Data Association and Track Management

The association of per-frame detections to evolving track hypotheses is a central challenge, with solutions varying from classic combinatorial optimization to fully learned schemes:

Cost Matrix Construction: Approaches fuse geometric (IoU, GIoU, Mahalanobis), motion, and multi-modal appearance affinities. Learned fusion networks combine 2D/3D visual embeddings and spatiotemporal cues (Chiu et al., 2020). UTrack demonstrates that representing detection uncertainty as per-coordinate covariance, extracted from the suppressed box ensemble at NMS time, and propagating it through the association pipeline, significantly increases robustness to detector noise (Solano-Carrillo et al., 2024).
Assignment Algorithms: Hungarian and Greedy assignment dominate, with adaptive gating to control the hypothesis space (e.g., velocity-dependent thresholds for robust tracking in rugged or high-dynamic environments (Huang et al., 2023)). Probabilistic association, using log-likelihood distances that incorporate detection/track covariances, yields improved match quality in noisy domains (e.g., imaging radar (Palmer et al., 2024)).
Long-range and Occlusion Handling: Bi-directional forward/backward motion matching, stranded (lost) track buffers, and cyclic pseudo-observation interpolation allow recovery from both short occlusions and extended target disappearances (Luo et al., 2023 Han et al., 2020). Re-activation modules leveraging re-identification embeddings, memory banks, or deep long-term feature aggregation (multi-shot feature learning) can relink interrupted tracklets and suppress false ID switches (Li et al., 2023 Shuai et al., 2020).

3. Representation Learning and Feature Fusion

Modern MOT systems rely on learned representations at multiple levels:

Appearance Models: Hierarchical feature learning (compositional/semantic/contextual) (Cao et al., 2024) and transformer/self-attention-based pixel/region encodings (Li et al., 2023) supplement or replace conventional re-identification embeddings. VisualTracker, for example, ensembles single-shot (adjacent frame) and multi-shot (tracklet) feature learning to robustly associate under occlusion and distractors (Li et al., 2023).
Motion Models: Interaction-aware modules leverage attention or graph convolutions to learn short/long-term motion patterns and explicitly model multi-target influences, addressing nonlinear and group dynamics (Qin et al., 2023 Nguyen et al., 28 Oct 2025). Deep extended Kalman Filters with latent-space LSTM priors and visual-attention measured updates allow the integration of nonlinear kinematics and appearance cues for robust tracking in challenging aerial scenarios (Xie et al., 2021).
Modality Fusion: Probabilistic, multi-modal trackers learn to fuse 2D/3D geometric and visual features (e.g., Mask R-CNN + voxelized LiDAR features) and learn adaptive metric combinations for association (Chiu et al., 2020).

4. Specialized Domains: 3D, Adverse Conditions, and New Modalities

MOT has been extended to and challenged by numerous practical settings and input modalities:

3D and Multi-Modal Tracking: Systems such as those integrating LiDAR-camera fusion, dynamic SLAM integration, and memory-augmented neural trajectory predictors track objects robustly in 3D space, even on rugged terrain, supporting both object-based and map-based downstream tasks (Huang et al., 2023). Imaging radar tracking requires not only robust association under extreme point sparsity/noise but also advanced probabilistic gating for consistency (Palmer et al., 2024).
Low-Light and Adverse Weather: Low-light MOT (LTrack) addresses the sensor-noise dominated regime with a dual-camera capture/annotation protocol (LMOT dataset) and specialized image-to-feature modules that adaptively low-pass filter and enforce feature invariance under noise, providing consistent improvements in both detection and tracking association metrics (Wang et al., 2024).
Multi-Object Tracking and Segmentation (MOTS): Extending MOT to pixel-level masks strengthens association cues and segmentation quality. MOTS metrics (sMOTSA, MOTSA, MOTSP) are defined on segmented tracklets, and unified architectures such as TrackR-CNN jointly learn detection, tracking, and segmentation, often with temporal 3D-conv layers and association heads that enforce embedding consistency across frames (Voigtlaender et al., 2019).

5. Efficiency, Real-Time Constraints, and System Integration

Designing MOT systems with constrained latency is critical for real-world deployment:

Detection Efficiency: Systems leveraging single-shot detectors (SSD, YOLOX) in tracking-by-re-detection configurations achieve O(1) per-frame complexity. Detector ensembles, scheduled at different frequencies with smart box-level fusion, enable trade-offs between speed and accuracy (Li et al., 2020 Cobos et al., 2019).
Association Acceleration: Specialized data structures (e.g., 3D integral images) and search region pruning limit candidate associations (Han et al., 2020). Query-based or attention-based association can match large hypothesis sets rapidly by contrasting multi-stage features (Cao et al., 2024 Li et al., 2023).
Unified Models: Architectures integrating detection, association, motion, and re-identification into a single backbone (e.g., Siamese Track-RCNN with multi-task heads) reduce FLOPs and inference time, while preserving or exceeding state-of-the-art accuracy on benchmark datasets (Shuai et al., 2020).
Flexible Inference: DiffusionTrack introduces a denoising-diffusion model that can dynamically adjust speed/accuracy trade-offs at test-time by varying the number of inference steps and proposal samples. This framework exhibits strong robustness to detection perturbations and can scale its computational budget per frame as necessary (Luo et al., 2023).

6. Performance Benchmarks and Comparative Analysis

Quantitative evaluation on standard datasets (MOT17, MOT20, DanceTrack, KITTI, NuScenes) is reported using more nuanced metrics:

Method	HOTA ↑	MOTA ↑	IDF1 ↑	FPS	Benchmark
ByteTrack (Luo et al., 2022)	63.1	80.3	77.3	30	MOT17
MotionTrack (Qin et al., 2023)	65.1	81.1	80.1	20-30	MOT17
MAT (Han et al., 2020)	63.1	69.5	63.1	9	MOT17
GenTrack PSO-Social (Nguyen et al., 28 Oct 2025)	68.6	85.1	93.6	>10	MOT17-04/MooTrack360
LTrack (Wang et al., 2024)	29.4	-	35.2	27-32	LMOT-dual, LMOT-real
CSC-Tracker (Cao et al., 2024)	60.8	75.4	75.7	21.3	MOT17
VisualTracker (Li et al., 2023)	64.5	80.6	79.6	8-12	MOT17
DiffusionTrack (Luo et al., 2023)	60.8	77.9	73.8	10-21	MOT17
UTrack (Solano-Carrillo et al., 2024)	55.8	89.7	56.4	27	DanceTrack
Probabilistic 3D MMOT (Chiu et al., 2020)	68.7 (AMOTA)	93.9	-	-	NuScenes/KITTI

These results highlight continual advances in both state estimation and ID assignment accuracy across increasingly challenging scenarios. Systems differ by their methodology, level of supervision, and real-time or near-real-time feasibility. There is a substantive trend toward hybrid models capable of leveraging uncertainty, deep graph/message passing, and hierarchical visual semantics.

7. Open Problems and Future Directions

The literature emphasizes several persistent research directions:

Robustness to Uncertainty and Partial Observability: Explicit uncertainty propagation from detection through association, as realized in UTrack, and probabilistic, memory-augmented trackers, e.g., in 3D/SLAM (Solano-Carrillo et al., 2024 Huang et al., 2023).
Global Reasoning and Multi-frame Context: Graph-based models with rolling or retained memory windows are increasingly effective for long-term reasoning, error correction, and occlusion recovery (Rangesh et al., 2021). Joint detection-association diffusion and attention-based architectures allow holistic, spatiotemporal consistency (Luo et al., 2023 Cao et al., 2024).
Domain Adaptation and New Modalities: There is strong demand for continued extension into domains with adverse conditions (low light, weather), new sensor modalities (event, radar, multi-spectral), and new output representations (segmentations, trajectories, dynamic maps) (Wang et al., 2024 Palmer et al., 2024 Voigtlaender et al., 2019).
Scaling and Ontology-Awareness: Future work must address category-agnostic and open-world tracking, scaling memory/compute requirements for massive or long video streams, and hierarchical scene understanding.
Benchmarking and Standardization: Continued development of specialized tracking datasets (e.g., LMOT for low-light, MOTS for segmentation) as well as adoption of advanced association-centric metrics (HOTA, sMOTSA, AMOTA) provide a more granular view of performance and failure modes (Wang et al., 2024 Voigtlaender et al., 2019 Chiu et al., 2020).