Learning to Detect and Track Visible and Occluded Body Joints in a Virtual World

Published 22 Mar 2018 in cs.CV | (1803.08319v3)

Abstract: Multi-People Tracking in an open-world setting requires a special effort in precise detection. Moreover, temporal continuity in the detection phase gains more importance when scene cluttering introduces the challenging problems of occluded targets. For the purpose, we propose a deep network architecture that jointly extracts people body parts and associates them across short temporal spans. Our model explicitly deals with occluded body parts, by hallucinating plausible solutions of not visible joints. We propose a new end-to-end architecture composed by four branches (visible heatmaps, occluded heatmaps, part affinity fields and temporal affinity fields) fed by a time linker feature extractor. To overcome the lack of surveillance data with tracking, body part and occlusion annotations we created the vastest Computer Graphics dataset for people tracking in urban scenarios by exploiting a photorealistic videogame. It is up to now the vastest dataset (about 500.000 frames, almost 10 million body poses) of human body parts for people tracking in urban scenarios. Our architecture trained on virtual data exhibits good generalization capabilities also on public real tracking benchmarks, when image resolution and sharpness are high enough, producing reliable tracklets useful for further batch data association or re-id modules.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (162)

View on Semantic Scholar

Summary

The paper introduces THOPA-net, a unified deep learning architecture that simultaneously detects and tracks visible and occluded body joints, significantly improving detection accuracy.
It leverages the comprehensive JTA dataset with 500K frames and nearly 10M labeled poses to overcome limitations in occlusion handling and provide detailed urban scenario annotations.
The integration of temporal affinity fields (TAFs) ensures robust short-term tracking in cluttered environments, making the approach viable for real-world urban surveillance applications.

Detection and Tracking of Occluded Body Joints in Virtual Environments

The examined paper presents a sophisticated approach to Multi-People Tracking (MPT) in complex, cluttered environments. This is achieved through an end-to-end deep learning architecture, referred to as THOPA-net, which jointly addresses the detection and short-term tracking of visible as well as occluded body joints within a virtual setting, specifically leveraging data derived from a hyper-realistic video game, Grand Theft Auto V.

The core of the research lies in the ability of THOPA-net to perform integrated detection and tracking, which traditionally have been treated as separate processes. The network comprises four distinct branches: visible heatmaps, occluded heatmaps, part affinity fields (PAFs), and temporal affinity fields (TAFs). This structure allows the system to address the intrinsic challenges posed by occlusions, which are frequent in urban environments.

One of the significant contributions is the introduction of the Joint Track Auto (JTA) dataset. This dataset boasts approximately 500,000 frames and nearly 10 million labeled body poses, making it the most comprehensive dataset dedicated to people tracking in urban scenarios. The data encompasses a diverse range of body poses and conditions, provided alongside exhaustive annotations that include both 2D and 3D information. This extensive dataset overcomes the limitations of existing real-world datasets by providing precise occlusion annotations and tracking information which are automatically derived from the game engine.

In evaluating the network, results demonstrate that the architecture's explicit consideration of occluded parts substantially enhances detection accuracy. For instance, the method improved mean average precision (mAP) in joint detection when accounting for both visible and occluded joints, with a marked increase over state-of-the-art methods like those based on \cite{cao2017realtime}.

Another key aspect is the deployment of TAFs, which ensure that the temporal dimension is thoroughly integrated, thus improving tracking continuity even in highly dynamic scenes. Experimentation on the JTA dataset indicates that such a comprehensive model maintains robustness in crowded scenes where occlusions are common. When applied to real-world benchmarks such as MOT-16, albeit trained primarily on synthetic data, the model exhibits commendable transfer capabilities by maintaining competitive Multiple Object Tracking Accuracy (MOTA) scores.

The implications of this research are notable. Practically, the methodology provides a viable solution for urban surveillance systems where both detection accuracy and short-term tracking reliability are paramount under occlusion-heavy conditions. Theoretically, the clear synergy between detection and tracking in THOPA-net suggests further exploration into integrated network architectures in other domains of computer vision could yield similar performance gains.

Future developments may include enhancing the network's ability to generalize across varying levels of image quality and resolution. Furthermore, there is potential for integrating a re-identification module to reinforce long-term tracking through more extended temporal association of tracklets, addressing current limitations in ID-switches and trajectory fragmentations.

In summary, the paper introduces a robust framework capable of overcoming the existing hurdles in MPT within occluded and cluttered environments. Through synthetic data, the authors not only enhance the model's performance but also set a new benchmark in synthetic-to-real transfer learning within the scope of urban people tracking.

Markdown Report Issue