- The paper presents a Joint Detection and Embedding (JDE) model that combines detection and appearance embedding in a single forward pass to reduce computational redundancy.
- It achieves real-time efficiency by processing at 22-40 FPS while maintaining competitive tracking accuracy, with 64.4% MOTA on the MOT-16 benchmark.
- The approach leverages an FPN architecture, multi-task learning with automatic loss balancing, and an online association method using the Hungarian algorithm for robust tracking.
Towards Real-Time Multi-Object Tracking
This paper addresses the computational challenges inherent in modern multiple object tracking (MOT) systems, specifically those following the tracking-by-detection paradigm. Current systems predominantly feature separately executed detection and embedding (re-identification) models, leading to efficiency constraints as the inference time is essentially the sum of both steps. The novel contribution of this work is a Joint Detection and Embedding model (JDE), designed to output detections and corresponding embeddings in a single forward pass. This work represents one of the first (near) real-time MOT systems with frame rates ranging from 22 to 40 frames per second (FPS) depending on the input resolution, and achieving competitive tracking accuracy (64.4%
MOTA on the MOT-16 challenge).
Methodology
The methodology revolves around several key innovations, summarized as follows:
- Joint Detection and Embedding (JDE) Model: The authors propose integrating the detection and appearance embedding models into a single-shot deep network. This integration allows both tasks to share low-level features, minimizing redundant computations and improving system efficiency.
- Training Data and Architecture: A large-scale unified dataset was constructed from six publicly available pedestrian detection and person search datasets. The network employs a Feature Pyramid Network (FPN) architecture, utilizing skip connections and multi-scale predictions to handle varying target scales effectively.
- Learning Objectives and Optimization: The detection branch uses cross-entropy and smooth-L1 losses, whereas the embedding branch leverages a cross-entropy loss (found superior in experiments over triplet and its upper bound losses). Automatic loss balancing using task-dependent uncertainty further refines multi-task learning.
- Association Strategy: An online association method based on appearance and motion affinity is adopted. Tracklets are maintained in a pool, and associations are updated using a linear assignment problem solved by the Hungarian algorithm.
Results
In terms of numerical results, the JDE model demonstrates high efficiency and competitive accuracy:
- Tracking Accuracy: JDE achieves
64.4%
MOTA on the MOT-16 test set, approaching the performance of state-of-the-art separate detection and embedding (SDE) methods (e.g., 66.1%
MOTA).
- Frame Rate: JDE offers impressive running speeds, managing up to 22 FPS at high resolutions (
1088×608
pixels) and up to 30.3 FPS at lower resolutions (864×408
pixels).
Comparative Analysis
Experiments comparing JDE with various SDE combinations reveal that JDE maintains a favorable balance between accuracy and speed. Unlike SDE models that experience a significant speed drop under crowded conditions due to increased computational demands for embedding extraction, JDE remains relatively unaffected. This stability in performance highlights the advantage of the integrated approach, particularly in scenarios with high pedestrian density.
Further comparisons with existing MOT systems under the 'private data' protocol of the MOT-16 benchmark underscore JDE's efficiency. It runs significantly faster (up to 30.3 FPS) than existing methods, which typically require multiple models for detection and embedding, yielding at best ~15 FPS
.
Implications and Future Work
The proposed JDE model highlights the practical benefits of a single-shot MOT system, particularly for real-time applications in autonomous driving, smart surveillance, and other areas requiring efficient handling of video data. While the tracking accuracy is already competitive, the primary area for future improvement lies in better handling of significant pedestrian overlaps, which currently cause detection inaccuracies and subsequent ID switches.
In conclusion, this work sets a new benchmark for real-time MOT systems by effectively reducing computational overhead without compromising accuracy. Future research could focus on refining the JDE model's robustness in high-density scenarios and exploring further optimizations in the multi-task learning framework.