AR-MOT: Autoregressive Multi-Object Tracking
- AR-MOT is an autoregressive multi-object tracking framework that models tracking as a token-based sequence generation task, enabling flexible outputs.
- It integrates object tokenization, region-aware alignment, and temporal memory fusion to achieve competitive performance on benchmarks like MOT17 and DanceTrack.
- The method extends naturally to multimodal scenarios, including auditory and aerial tracking, thereby supporting instruction-driven and real-time applications.
AR-MOT (Autoregressive Multi-Object Tracking) refers to a class of methods and frameworks for multi-object tracking (MOT) that leverage autoregressive, sequence-based modeling paradigms—primarily inspired by LLMs—to produce structured tracking outputs. Distinct from rigid, task-specific pipelines, AR-MOT formulates MOT as a sequence generation task in which the model predicts object identities and track associations autoregressively, with significant implications for extensibility, multi-modality, and instruction-driven tracking (Jia et al., 5 Jan 2026, Lin et al., 2024, Chen et al., 26 Nov 2025).
1. Autoregressive Paradigm in Multi-Object Tracking
AR-MOT reconceptualizes classic MOT as an autoregressive token prediction problem. Each video frame yields detections , whose track IDs are generated in sequence, conditioned on a contextual history. The probability of predicted IDs is factorized: where the context concatenates recent frame token histories, current image tokens, object tokens, and preceding ID tokens. This approach, analogous to in LLMs, dispenses with task-specific output heads and instead generates flexible, structured outputs by sequence modeling (Jia et al., 5 Jan 2026).
This paradigm facilitates the integration of complex and diverse tracking formulations, such as referring/instruction-driven tracking and multi-modal AR-MOT.
2. Core Methodological Components
2.1 Object Tokenization and Representation
The Object Tokenizer maps detector backbone outputs (e.g., Deformable-DETR using ResNet-50) into the model’s language space. Each detection query is projected: Bounding boxes are discretized to form tokens (as in Pix2Seq), with coordinates quantized into bins to add object geometry to the token vocabulary (Jia et al., 5 Jan 2026).
2.2 Region-Aware Alignment (RAA)
RAA mitigates the misalignment between global image tokens and localized object tokens. For each detection, the corresponding region on the patch grid is identified, and the involved image tokens are averaged to yield a region descriptor . Object and region features are concatenated and linearly transformed: Optionally, a regularization loss may be used to enforce alignment (Jia et al., 5 Jan 2026).
2.3 Temporal Memory Fusion (TMF)
TMF addresses long-term association by maintaining a compressed history (memory tokens) per object across frames. The memory token from the previous timestep and the current frame’s hidden state are fused via multi-head attention: is the aggregate memory from previous frames and is the current hidden state (Jia et al., 5 Jan 2026).
2.4 Training and Loss
The full training objective combines standard object detection terms (classification, , GIoU) with ID sequence cross-entropy (for token prediction), and optionally alignment and memory regularization: This joint loss supports precise object location and robust identity association (Jia et al., 5 Jan 2026).
2.5 Inference Workflow
AR-MOT’s inference comprises detection and tokenization, context/history construction, autoregressive ID decoding (each object token yields a track ID through sequence modeling), and temporal context management. The process supports both sliding window (“history”) and memory-based (TMF) temporal aggregation. Below is an implementation sketch (without TMF) (Jia et al., 5 Jan 2026):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
initialize active_tracks = {}
initialize TCM = {}
for each frame I_i:
B = DETR.detect(I_i)
E_list = object_adapter(Q_list)
history = last_T_frames_tokens + TCM.values()
image_tokens = image_tokenizer(I_i)
input_seq = Concat(history, image_tokens, E_list)
for E_i_j in E_list:
id_token = LLM.generate_next(input_seq || E_i_j)
if id_token == <new>:
assign new track_id
else:
track_id = id_token
active_tracks[track_id].append(B_i_j)
TCM = {id: last E embedding for each active track} |
3. Applications and Use Cases
3.1 Standard Multi-Object Tracking
Empirical evaluation on MOT17 yields competitive results: MOTA 66.1, IDF1 60.7, HOTA 50.1. On DanceTrack, AR-MOT achieves HOTA 48.1, DetA 77.2, MOTA 86.3, AssA 30.2, IDF1 43.9, outperforming several track-by-detection (TBD) and track-by-query (TBQ) baselines (Jia et al., 5 Jan 2026). This demonstrates that the autoregressive approach, though architecturally flexible, is competitive on classic benchmarks.
3.2 Multimodal and Referring MOT
The AR-MOT paradigm enables natural extensions to referring or instruction-driven MOT, as well as auditory-guided MOT (AR-MOT, auditory referring MOT). In EchoTrack, objects are tracked according to external audio signals, fusing dual-stream visual/audio representations via bidirectional frequency-domain cross-attention (Bi-FCFM). Learning is supervised by an audio-visual contrastive tracking loss (ACTL). EchoTrack achieves state-of-the-art results on audio-referred MOT datasets, with HOTA 37.14 and IDF1 44.30 on Echo-KITTI (Lin et al., 2024).
3.3 UAV/Aerial Multi-Object Tracking
In aerial MOT, large-scale stream processing architectures (Kafka, Spark, YOLOv8/v10 with trackers like BoTSORT/ByteTrack) realize real-time AR-MOT for UAV video. These solutions achieve HOTA ≈48.1, MOTA ≈43.5 at ~28 FPS per GPU (Do et al., 6 Feb 2025). For referring MOT in UAV contexts, notable benchmarks like AerialMind utilize the COALA labeling pipeline and HETrack model, integrating bidirectional fusion and scale-adaptive contextual refinement for vision-language representations. HETrack yields HOTA 31.5 on VisDrone and 31.6 on UAVDT (Chen et al., 26 Nov 2025).
4. Extensibility and Multimodal Generalization
Because AR-MOT treats all inputs and outputs as composable sequences of tokens, extensibility is inherent. New modalities or output targets are included by tokenization (e.g., pose, depth, language instructions), with no core architectural modification. For instance, adding 3D-pose prediction simply appends a <pose> token and corresponding tokens to the object sequence: The underlying autoregressive sequence model remains unchanged, highlighting plug-and-play extensibility and the paradigm’s suitability for rapidly evolving, multi-modal tracking scenarios (Jia et al., 5 Jan 2026).
A plausible implication is that AR-MOT modeling subsumes text-based, audio-based, and spatially/contextually conditioned tracking, supporting instruction-driven and perception-driven robotics.
5. Benchmarks, Datasets, and Quantitative Comparison
AR-MOT and related paradigms have led to new public benchmarks:
- Standard benchmarks: MOT17, DanceTrack—AR-MOT is competitive on these (Jia et al., 5 Jan 2026).
- Audio-referred MOT: Echo-KITTI, Echo-KITTI+, Echo-BDD—AR-MOT/EchoTrack show improved HOTA and IDF1 (Lin et al., 2024).
- Aerial/Referring MOT: AerialMind—HETrack sets new UAV baselines for HOTA, DetA, AssA, with robust cross-domain generalization (Chen et al., 26 Nov 2025).
A summary table of AR-MOT performance highlights is below:
| Dataset | Method | HOTA | MOTA | IDF1 | FPS |
|---|---|---|---|---|---|
| MOT17 | AR-MOT | 50.1 | 66.1 | 60.7 | N/A |
| DanceTrack | AR-MOT | 48.1 | 86.3 | 43.9 | N/A |
| Echo-KITTI | EchoTrack | 37.1 | 13.4 | 44.3 | N/A |
| VisDrone2019-MOT | RAMOTS | 48.1 | 43.5 | 61.5 | 28 |
| AerialMind (UAV) | HETrack | 31.5 | N/A | N/A | 15.6 |
6. Limitations and Future Directions
Challenges include managing the computational cost of large autoregressive models for real-time deployment, especially in embedded or resource-limited UAV hardware, and robustly handling drastic appearance shifts and complex linguistic semantics in aerial or multimodal environments (Chen et al., 26 Nov 2025, Do et al., 6 Feb 2025). Research directions identified in the literature include:
- Lightweight, efficient AR-MOT variants for onboard or edge inference (Chen et al., 26 Nov 2025).
- Onboard or inference-time LLM reasoning to execute out-of-distribution or zero-shot referrals.
- Unsupervised/few-shot domain adaptation for new environments.
- End-to-end integration with flight-path planning for embodied platforms.
- Joint perception, reasoning, and tracking using modular token-based architectures.
Limitations of the current state also include inherited annotation errors in large video datasets, complexity scaling, and incomplete utilization of deep LLM reasoning at inference (Chen et al., 26 Nov 2025).
7. Summary and Significance
AR-MOT represents a shift in multi-object tracking: recasting MOT as a sequence modeling task enables unprecedented extensibility, multi-modality, and composability in tracking architectures. By leveraging tokenization, autoregressive generation, and modular fusion mechanisms, AR-MOT frameworks connect advances in LLMs to core challenges in visual perception and robotics. Public benchmarks and empirical results demonstrate that AR-MOT is competitive on classic metrics and unlocks new capabilities in referring, auditory, and aerial multi-object tracking (Jia et al., 5 Jan 2026, Lin et al., 2024, Chen et al., 26 Nov 2025).
Key references:
- "AR-MOT: Autoregressive Multi-object Tracking" (Jia et al., 5 Jan 2026)
- "EchoTrack: Auditory Referring Multi-Object Tracking for Autonomous Driving" (Lin et al., 2024)
- "AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios" (Chen et al., 26 Nov 2025)
- "RAMOTS: A Real-Time System for Aerial Multi-Object Tracking based on Deep Learning and Big Data Technology" (Do et al., 6 Feb 2025)