Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Appearance Graphs (2409.17221v1)

Published 25 Sep 2024 in cs.CV

Abstract: The supervision of state-of-the-art multiple object tracking (MOT) methods requires enormous annotation efforts to provide bounding boxes for all frames of all videos, and instance IDs to associate them through time. To this end, we introduce Walker, the first self-supervised tracker that learns from videos with sparse bounding box annotations, and no tracking labels. First, we design a quasi-dense temporal object appearance graph, and propose a novel multi-positive contrastive objective to optimize random walks on the graph and learn instance similarities. Then, we introduce an algorithm to enforce mutually-exclusive connective properties across instances in the graph, optimizing the learned topology for MOT. At inference time, we propose to associate detected instances to tracklets based on the max-likelihood transition state under motion-constrained bi-directional walks. Walker is the first self-supervised tracker to achieve competitive performance on MOT17, DanceTrack, and BDD100K. Remarkably, our proposal outperforms the previous self-supervised trackers even when drastically reducing the annotation requirements by up to 400x.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a novel self-supervised approach that uses temporal appearance graphs to learn object associations with significantly fewer annotations.
It employs a multi-positive contrastive learning strategy to optimize random walks, ensuring robust appearance-based tracking across frames.
Experiments on benchmarks like DanceTrack and MOT17 show Walker achieves state-of-the-art tracking accuracy with up to 400x fewer annotations.

Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Appearance Graphs

In the field of computer vision, multiple object tracking (MOT) is pivotal for various real-world applications, including autonomous driving, video surveillance, and augmented reality. Traditional approaches to MOT rely heavily on tracking-by-detection paradigms that require significant annotation efforts to provide bounding boxes and instance identities for all frames across videos. The paper "Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Appearance Graphs" presents an innovative self-supervised approach to MOT that minimizes annotation requirements, focusing instead on learning from sparsely annotated videos without reliance on tracking labels.

Methodology

Core Concept

Walker introduces a novel self-supervised methodology for MOT, featuring a quasi-dense temporal object appearance graph (TOAG). This graph enables the reinforcement of appearance consistency across frames by optimizing random walks and leveraging motion constraints. The key innovation lies in a multi-positive contrastive objective, designed to optimize these random walks on the TOAG, allowing the model to learn instance similarities effectively.

Graph Construction

Nodes in the TOAG represent object-level regions of interest (RoIs), and the edges are defined by the cosine similarities between these nodes' appearance embeddings. Nodes and edges are adaptively connected across frames to form the graph, capturing the temporal consistency of object appearances. This graph-centric approach avoids the need to closely track each object's specific id across frames, which is traditionally burdensome in terms of labeling effort.

Training Objective

Walker operates through a two-fold training objective:

Multi-Positive Contrastive Learning: By employing a contrastive loss over cyclic random walks in the TOAG, the method encourages nodes to be classified into clusters of appearance similarity. This learning is distributed over several positive examples per node, not just a single positive-negative pair.
Mutually-Exclusive Connectivity Enforcement: This further refines the graph structure to ensure mutually-exclusive assignments, facilitating accurate instance differentiation. The TOAG is optimized to create a robust transition path between nodes, representing the same object, across multiple frames.

Inference

During inference, Walker establishes object associations based on the max-likelihood transition state, derived from motion-constrained bidirectional walks within the TOAG. This method enhances tracking accuracy and robustness by combining appearance and motion cues, significantly outperforming previous self-supervised MOT methods even with drastically reduced annotation requirements.

Experimental Results

Walker was evaluated across multiple MOT benchmarks, specifically MOT17, DanceTrack, and BDD100K, showcasing competitive performance against state-of-the-art supervised MOT methods. Notably, on sparsely annotated datasets, Walker demonstrated exceptional efficiency, achieving high detection and association accuracy with up to 400x fewer annotations compared to supervised methods. On DanceTrack, Walker achieved 45.9 HOTA, significantly surpassing the previous self-supervised methods (e.g., QDTrack-S with 29.2 HOTA). Similarly, Walker delivered robust performance on densely annotated datasets such as MOT17, where it recorded 63.6 HOTA and approached but did not surpass the best supervised methods in terms of IDF1 and MOTA scores.

Practical and Theoretical Implications

Practical Implications

Reduced Annotation Effort: By operating effectively under sparse annotations, Walker offers a significant reduction in the manual labor associated with dataset preparation. This efficiency is crucial for scaling tracking applications across diverse and expansive real-world scenarios where exhaustive annotations are impractical.
Robustness to Appearance Changes: The self-supervised approach's capacity to learn robust appearance features from temporal information enables better handling of dynamic scenes with high variability in object appearance.

Theoretical Implications

Advancements in Self-Supervision: The introduction of multi-positive contrastive learning combined with mutually-exclusive assignment optimization represents a notable advancement in self-supervised learning paradigms. This could inspire further research into improving the granularity and efficiency of self-supervised models across other domains in computer vision.
Graph-based Learning: The paper reinforces the utility of graph-based representations for handling temporal sequences in computer vision, with potential applications extending to other forms of sequential data beyond video tracking.

Future Developments

Future work may explore further optimization of the TOAG framework, potentially integrating advanced temporal modeling techniques such as transformers. Additionally, expanding Walker's applicability to fully unsupervised settings or domain adaptation scenarios could make it even more versatile. The methodology could also potentially integrate multimodal data to enhance action recognition and tracking in complex environments.

In conclusion, Walker presents a sophisticated self-supervised method for MOT that significantly reduces annotation burdens while maintaining high accuracy. Its innovative use of TOAGs and multi-positive contrastive learning sets a new benchmark for self-supervised tracking systems, demonstrating both practical efficacy and theoretical promise.