Self-Supervised Multi-Object Tracking For Autonomous Driving From Consistency Across Timescales

Published 25 Apr 2023 in cs.CV | (2304.13147v2)

Abstract: Self-supervised multi-object trackers have tremendous potential as they enable learning from raw domain-specific data. However, their re-identification accuracy still falls short compared to their supervised counterparts. We hypothesize that this drawback results from formulating self-supervised objectives that are limited to single frames or frame pairs. Such formulations do not capture sufficient visual appearance variations to facilitate learning consistent re-identification features for autonomous driving when the frame rate is low or object dynamics are high. In this work, we propose a training objective that enables self-supervised learning of re-identification features from multiple sequential frames by enforcing consistent association scores across short and long timescales. We perform extensive evaluations demonstrating that re-identification features trained from longer sequences significantly reduce ID switches on standard autonomous driving datasets compared to existing self-supervised learning methods, which are limited to training on frame pairs. Using our proposed SubCo loss function, we set the new state-of-the-art among self-supervised methods and even perform on par with fully supervised learning methods.

Abstract PDF Upgrade to Chat

Citations (3)

View on Semantic Scholar

Summary

The paper introduces SubCo, a self-supervised objective that leverages temporal consistency across timescales to enhance multi-object tracking in autonomous driving.
By utilizing consecutive frame sequences and enforcing consistent association scores, the model significantly reduces identity switches and improves re-identification accuracy.
Experimental results on datasets like BDD100k and MOT17 demonstrate that the SubCo-trained framework rivals state-of-the-art supervised approaches.

Self-Supervised Multi-Object Tracking For Autonomous Driving

The paper "Self-Supervised Multi-Object Tracking For Autonomous Driving From Consistency Across Timescales" proposes a novel self-supervised learning framework targeted at enhancing multi-object tracking accuracy in autonomous driving systems. The core contribution involves training a self-supervised tracking model using a novel objective function, SubCo, which leverages temporal consistency across short and long timescales to create robust re-identification features.

Introduction

In autonomous vehicle systems, accurately tracking the dynamics of surrounding objects is critical for safe navigation. Multi-object tracking (MOT) methods often require extensive labeled data, which limits their adaptability across changing environments or sensor configurations. Self-supervised learning (SSL) approaches circumvent these limitations by learning directly from raw sensor data, eliminating the need for expensive annotations. The paper hypothesizes that typical SSL formulations fall short in re-identification accuracy due to their limited scope to single frames or frame pairs, which fail to capture significant visual appearance variations necessary for consistent feature learning.

Methodology

To address the limitations of existing self-supervised methods, the authors introduce a self-supervised tracking mechanism utilizing the SubCo training objective, which improves feature learning from multiple consecutive frames. The model enforces consistent association scores across both short and long timescales, thereby enhancing the robustness of re-identification features across occlusion scenarios and variable frame rates.

Figure 1: We propose the self-supervised tracking objective \gls{subco} that learns re-identification features for tracking object instances along a sequence by enforcing consistent association scores when tracking at short timescales (dotted lines) and long timescales (solid lines).

The SubCo loss is computed by propagating association scores through a sequence of frames, allowing the model to handle both full and partial occlusions naturally occurring in training data. The evaluation process integrates the proposed YOLO-X architecture for object detection, with subsequent re-identification achieved through ResNet or Vision Transformer (ViT) models optimized via the SubCo loss formulation.

Figure 2: Illustration of the proposed training pipeline. A YOLO-X object detector decodes detected bounding boxes into re-identification feature vectors using a dedicated feature extractor.

Experimental Evaluation

Comprehensive evaluations are conducted on standard autonomous driving datasets such as BDD100k and MOT17. The results demonstrate that the SubCo-trained models significantly outperform existing self-supervised methods and are competitive with state-of-the-art supervised tracking methods in reducing identity switches and enhancing tracking accuracy.

The experiments validate the hypothesis that longer frame sequences in training improve tracking consistency, thereby suggesting that capturing a richer temporal span of object features aids in developing discriminative re-identification capabilities. A detailed ablation study further explores varying sequence lengths and different training components, reinforcing the efficacy of the SubCo loss in achieving lower ID switches and higher tracking accuracy.

A discussion of prior approaches for self-supervised MOT methods highlights the challenges in deriving robust correlation signals between object detections. Existing strategies employ pseudo-labels, identity-free tasks, or heavily rely on temporal augmentations, which often underperform due to handling real-world variances inadequately. The SubCo loss overcomes these issues by incorporating temporal dependencies and re-identifying individual object instances accurately across varied appearances and occlusion conditions.

Conclusion

The paper presents a significant stride in self-supervised multi-object tracking, addressing major limitations in current SSL strategies by leveraging consistency across timescales. The introduction of the SubCo loss function heralds potential avenues for future development in end-to-end tracking architectures and longer sequence trainings, offering promising implementations in autonomous navigation systems.

Empirical results underscore the viability of self-supervised methods matching the performance of supervised systems, exhibiting robustness across diverse ecological datasets such as BDD100k, thereby establishing a new benchmark in scaling self-supervised learning for object re-identification. Future explorations could extend this framework to broader domains and integrate it into joint detection-tracking models for holistic system optimization.

Markdown