Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking (2007.14557v1)

Published 29 Jul 2020 in cs.CV

Abstract: Existing Multiple-Object Tracking (MOT) methods either follow the tracking-by-detection paradigm to conduct object detection, feature extraction and data association separately, or have two of the three subtasks integrated to form a partially end-to-end solution. Going beyond these sub-optimal frameworks, we propose a simple online model named Chained-Tracker (CTracker), which naturally integrates all the three subtasks into an end-to-end solution (the first as far as we know). It chains paired bounding boxes regression results estimated from overlapping nodes, of which each node covers two adjacent frames. The paired regression is made attentive by object-attention (brought by a detection module) and identity-attention (ensured by an ID verification module). The two major novelties: chained structure and paired attentive regression, make CTracker simple, fast and effective, setting new MOTA records on MOT16 and MOT17 challenge datasets (67.6 and 66.6, respectively), without relying on any extra training data. The source code of CTracker can be found at: github.com/pjl1995/CTracker.

Citations (291)

View on Semantic Scholar

Summary

The paper proposes an integrated end-to-end framework that jointly detects and tracks multiple objects using a novel chained regression approach.
It leverages a joint attention module combining object detection and ID verification to simplify cross-frame association and improve tracking accuracy.
Empirical results on MOT16 and MOT17 datasets achieve MOTA scores of 67.6 and 66.6, demonstrating significant performance gains over traditional methods.

Chained-Tracker: Unified Detection and Tracking Framework

The paper introduces a unified framework, Chained-Tracker (CTracker), for end-to-end joint multiple-object detection and tracking (MOT), addressing limitations in traditional MOT methods. Unlike previous approaches that isolate tasks of object detection, feature extraction, and data association, CTracker integrates them, aiming for global optimization.

At its core, CTracker treats MOT as a coupled detection problem and uses paired bounding box regression results from sequential frames. This method is facilitated by object attention introduced via a detection module and identity attention ensured by an ID verification module. The primary innovations include a chained structure that operates over adjacent frame pairs, allowing simultaneous detection and tracking. The framework achieves notable performance improvements, setting new MOTA records on MOT16 and MOT17 datasets with scores of 67.6 and 66.6, respectively, without leveraging additional training data.

Methodological Insights

The framework constructs a chain by employing two adjacent frames as input, enabling simultaneous prediction of bounding box pairs for targets visible in both frames. This is coined as a chain node. The regression of these bounding box pairs is refined by a joint attention module, combining object classification to focus on relevant image regions, and ID verification to ensure the consistency of the tracked object identity.

The method translates the complex cross-frame association into a less complex paired-object detection problem, considerably simplifying the problem space. The model’s architecture employs ResNet-50 with Feature Pyramid Networks (FPN) to handle multi-scale object representation. The shared features of frame pairs are utilized to predict paired boxes, leading to improved tracking accuracy.

Empirical Results

CTracker achieves significant numerical improvements, notably in MOTA, on the MOT datasets without supplementary data. The results demonstrate its capability to balance complexity and execution speed efficiently. The proposed Joint Attention Module considerably enhances box regression accuracy and identity tracking consistency, critical for maintaining accurate trajectories in cluttered scenes.

Moreover, the integration approach tackles typical challenges of tracking-by-detection paradigms, reducing computational overhead and complexity. While retaining competitive performance in benchmark evaluations, the end-to-end trainable model represents progress towards real-time application requirements in video understanding and human behavior analysis, especially in complex environments with high object density and interaction.

Future Developments

This research positions CTracker as a viable pathway forward in MOT, emphasizing integrated frameworks that minimize decoupling in data association tasks. Future enhancements could involve incorporating adaptive learning mechanisms to further improve robustness against occlusions and incorporation of complementary modalities to extend applications across diverse environments.

Continued development in this domain may also involve the incorporation of more advanced temporal dynamics and application to broader object categories outside pedestrian-focused datasets. Furthermore, the extension of the framework to handle more complex motion patterns and a broader range of occlusion scenarios, utilizing advanced temporal correlation techniques, could offer increased tracking fidelity.

The exploration of differentiable linking processes could refine the end-to-end nature of CTracker, providing insights into comprehensive feature extraction and association approaches. Such advancements would contribute to broader-reaching impacts in autonomous navigation, surveillance, and interactive environments.

PDF Markdown