Learning to Track with Object Permanence (2103.14258v2)

Published 26 Mar 2021 in cs.CV

Abstract: Tracking by detection, the dominant approach for online multi-object tracking, alternates between localization and association steps. As a result, it strongly depends on the quality of instantaneous observations, often failing when objects are not fully visible. In contrast, tracking in humans is underlined by the notion of object permanence: once an object is recognized, we are aware of its physical existence and can approximately localize it even under full occlusions. In this work, we introduce an end-to-end trainable approach for joint object detection and tracking that is capable of such reasoning. We build on top of the recent CenterTrack architecture, which takes pairs of frames as input, and extend it to videos of arbitrary length. To this end, we augment the model with a spatio-temporal, recurrent memory module, allowing it to reason about object locations and identities in the current frame using all the previous history. It is, however, not obvious how to train such an approach. We study this question on a new, large-scale, synthetic dataset for multi-object tracking, which provides ground truth annotations for invisible objects, and propose several approaches for supervising tracking behind occlusions. Our model, trained jointly on synthetic and real data, outperforms the state of the art on KITTI and MOT17 datasets thanks to its robustness to occlusions.

Authors (4)

Pavel Tokmakov (32 papers)
Jie Li (553 papers)
Wolfram Burgard (149 papers)
Adrien Gaidon (84 papers)

Citations (192)

View on Semantic Scholar

Summary

Learning to Track with Object Permanence

The paper "Learning to Track with Object Permanence" by Tokmakov et al. introduces a novel approach to multi-object tracking by integrating the concept of object permanence into the model architecture. Traditional multi-object tracking, particularly in online scenarios, tends to rely heavily on tracking-by-detection methods. These methods alternate between steps of localization and association, heavily depending on immediate, visible observations of objects in the scene. This reliance typically leads to failures in tracking when objects become occluded or partially visible.

Addressing this limitation, the authors propose an end-to-end trainable model that extends the existing CenterTrack framework to handle arbitrary video sequences. CenterTrack was originally designed for handling short-term occlusions by taking pairs of frames as input but lacked the capability to maintain object identities through longer periods of complete occlusion. The key innovation in this work is the inclusion of a spatio-temporal, recurrent memory module, specifically a convolutional gated recurrent unit (ConvGRU), to the tracking architecture. This module enables the model to leverage the entire history of object observations, thereby introducing the notion of object permanence—an infant's ability to understand that objects continue to exist even when they are not visible.

The paper introduces a synthetic dataset generated via Parallel Domain's simulation platform to effectively train the model for scenarios involving full occlusions. This dataset provides ground-truth annotations for invisible objects, which are crucial for learning to track beyond immediate observations. The proposed model, trained on both synthetic and real-world data, demonstrated superior robustness to occlusions, outperforming state-of-the-art methods on established benchmarks such as KITTI and MOT17.

Key findings reported include significant improvements in tracking accuracy, particularly under scenarios of full occlusion, without realistic annotations for such cases in the real datasets. The model's ability to "hallucinate" the trajectories of objects that become completely occluded is a crucial leap forward. This is facilitated through a pseudo-ground-truth generation during training, where the last known velocity in the 3D scene underpins the predicted path of occluded targets.

Supervising this behavior presented notable challenges as standard tracking datasets lacked annotations for occluded objects at scale. The utilization of synthetic data bridges this gap, enabling robust training regimes while circumventing the costly alternative of manual annotation in real-world datasets. The paper employs a joint training strategy on mixed synthetic and real data to mitigate the domain gap, ensuring that learned behaviors from synthetic environments generalize effectively to real-world tracking tasks.

The implications of this research are profound for both practical applications and theoretical advancements in AI. Practically, the approach enhances tracking systems' reliability in cluttered and dynamic environments, such as autonomous driving scenarios, where temporary occlusions are frequent. Theoretically, it underscores the potential for integrating cognitive science concepts, such as object permanence, into machine learning frameworks to achieve more human-like perception abilities.

Future research directions could extend the methodology by exploring even richer representations of object permanence and augmenting interpretability measures within tracking models to better understand the decision-making process under occlusions. Integrating object permanence with real-world, multi-modal data—where visual, auditory, and possibly other sensory inputs converge—could significantly broaden the approach's applicability and robustness in real-world settings.

Overall, the paper demonstrates that incorporating cognitive principles like object permanence into computer vision models can substantially elevate their performance, particularly in scenarios fraught with occlusions, thereby advancing the frontier of online multi-object tracking methodologies.

PDF Markdown

Related Papers

YouTube

Show All Videos