Towards Real-Time Multi-Object Tracking (1909.12605v2)

Published 27 Sep 2019 in cs.CV

Abstract: Modern multiple object tracking (MOT) systems usually follow the \emph{tracking-by-detection} paradigm. It has 1) a detection model for target localization and 2) an appearance embedding model for data association. Having the two models separately executed might lead to efficiency problems, as the running time is simply a sum of the two steps without investigating potential structures that can be shared between them. Existing research efforts on real-time MOT usually focus on the association step, so they are essentially real-time association methods but not real-time MOT system. In this paper, we propose an MOT system that allows target detection and appearance embedding to be learned in a shared model. Specifically, we incorporate the appearance embedding model into a single-shot detector, such that the model can simultaneously output detections and the corresponding embeddings. We further propose a simple and fast association method that works in conjunction with the joint model. In both components the computation cost is significantly reduced compared with former MOT systems, resulting in a neat and fast baseline for future follow-ups on real-time MOT algorithm design. To our knowledge, this work reports the first (near) real-time MOT system, with a running speed of 22 to 40 FPS depending on the input resolution. Meanwhile, its tracking accuracy is comparable to the state-of-the-art trackers embodying separate detection and embedding (SDE) learning ($64.4\%$ MOTA \vs $66.1\%$ MOTA on MOT-16 challenge). Code and models are available at \url{https://github.com/Zhongdao/Towards-Realtime-MOT}.

Citations (796)

View on Semantic Scholar

Summary

The paper presents a Joint Detection and Embedding (JDE) model that combines detection and appearance embedding in a single forward pass to reduce computational redundancy.
It achieves real-time efficiency by processing at 22-40 FPS while maintaining competitive tracking accuracy, with 64.4% MOTA on the MOT-16 benchmark.
The approach leverages an FPN architecture, multi-task learning with automatic loss balancing, and an online association method using the Hungarian algorithm for robust tracking.

Towards Real-Time Multi-Object Tracking

This paper addresses the computational challenges inherent in modern multiple object tracking (MOT) systems, specifically those following the tracking-by-detection paradigm. Current systems predominantly feature separately executed detection and embedding (re-identification) models, leading to efficiency constraints as the inference time is essentially the sum of both steps. The novel contribution of this work is a Joint Detection and Embedding model (JDE), designed to output detections and corresponding embeddings in a single forward pass. This work represents one of the first (near) real-time MOT systems with frame rates ranging from 22 to 40 frames per second (FPS) depending on the input resolution, and achieving competitive tracking accuracy (64.4% MOTA on the MOT-16 challenge).

Methodology

The methodology revolves around several key innovations, summarized as follows:

Joint Detection and Embedding (JDE) Model: The authors propose integrating the detection and appearance embedding models into a single-shot deep network. This integration allows both tasks to share low-level features, minimizing redundant computations and improving system efficiency.
Training Data and Architecture: A large-scale unified dataset was constructed from six publicly available pedestrian detection and person search datasets. The network employs a Feature Pyramid Network (FPN) architecture, utilizing skip connections and multi-scale predictions to handle varying target scales effectively.
Learning Objectives and Optimization: The detection branch uses cross-entropy and smooth-L1 losses, whereas the embedding branch leverages a cross-entropy loss (found superior in experiments over triplet and its upper bound losses). Automatic loss balancing using task-dependent uncertainty further refines multi-task learning.
Association Strategy: An online association method based on appearance and motion affinity is adopted. Tracklets are maintained in a pool, and associations are updated using a linear assignment problem solved by the Hungarian algorithm.

Results

In terms of numerical results, the JDE model demonstrates high efficiency and competitive accuracy:

Tracking Accuracy: JDE achieves 64.4% MOTA on the MOT-16 test set, approaching the performance of state-of-the-art separate detection and embedding (SDE) methods (e.g., 66.1% MOTA).
Frame Rate: JDE offers impressive running speeds, managing up to 22 FPS at high resolutions (1088×608 pixels) and up to 30.3 FPS at lower resolutions (864×408 pixels).

Comparative Analysis

Experiments comparing JDE with various SDE combinations reveal that JDE maintains a favorable balance between accuracy and speed. Unlike SDE models that experience a significant speed drop under crowded conditions due to increased computational demands for embedding extraction, JDE remains relatively unaffected. This stability in performance highlights the advantage of the integrated approach, particularly in scenarios with high pedestrian density.

Further comparisons with existing MOT systems under the 'private data' protocol of the MOT-16 benchmark underscore JDE's efficiency. It runs significantly faster (up to 30.3 FPS) than existing methods, which typically require multiple models for detection and embedding, yielding at best ~15 FPS.

Implications and Future Work

The proposed JDE model highlights the practical benefits of a single-shot MOT system, particularly for real-time applications in autonomous driving, smart surveillance, and other areas requiring efficient handling of video data. While the tracking accuracy is already competitive, the primary area for future improvement lies in better handling of significant pedestrian overlaps, which currently cause detection inaccuracies and subsequent ID switches.

In conclusion, this work sets a new benchmark for real-time MOT systems by effectively reducing computational overhead without compromising accuracy. Future research could focus on refining the JDE model's robustness in high-density scenarios and exploring further optimizations in the multi-task learning framework.

PDF Markdown

Related Papers

GitHub

GitHub - Zhongdao/Towards-Realtime-MOT: Joint Detection and Embedding for fast multi-object tracking (2,362 stars)

YouTube

Show All Videos