ArtTrack: Articulated Multi-person Tracking in the Wild (1612.01465v3)

Published 5 Dec 2016 in cs.CV

Abstract: In this paper we propose an approach for articulated tracking of multiple people in unconstrained videos. Our starting point is a model that resembles existing architectures for single-frame pose estimation but is substantially faster. We achieve this in two ways: (1) by simplifying and sparsifying the body-part relationship graph and leveraging recent methods for faster inference, and (2) by offloading a substantial share of computation onto a feed-forward convolutional architecture that is able to detect and associate body joints of the same person even in clutter. We use this model to generate proposals for body joint locations and formulate articulated tracking as spatio-temporal grouping of such proposals. This allows to jointly solve the association problem for all people in the scene by propagating evidence from strong detections through time and enforcing constraints that each proposal can be assigned to one person only. We report results on a public MPII Human Pose benchmark and on a new MPII Video Pose dataset of image sequences with multiple people. We demonstrate that our model achieves state-of-the-art results while using only a fraction of time and is able to leverage temporal information to improve state-of-the-art for crowded scenes.

Citations (260)

View on Semantic Scholar

Summary

The paper presents a novel model that integrates single-frame pose estimation with temporal tracking to improve multi-person pose detection in dynamic video environments.
It utilizes a simplified body-part graph and CNN-based detection to reduce computational complexity while maintaining robust accuracy against occlusions and clutter.
Empirical results on MPII datasets confirm state-of-the-art performance, offering a scalable and efficient solution for real-time multi-person tracking applications.

An Analysis of ArtTrack: Articulated Multi-person Tracking in the Wild

The paper "ArtTrack: Articulated Multi-person Tracking in the Wild" presents a sophisticated approach to addressing the challenges of tracking multiple human poses in unconstrained video environments. This work contributes significantly to the field by proposing a model that consolidates single-frame pose estimation and spatio-temporal tracking, thereby enhancing efficiency and robustness in multi-person scenarios.

ArtTrack begins with a model architecture built on existing frameworks for single-frame pose estimation, optimizing it for speed through two primary modifications. First, it simplifies the body-part relationship graph and implements sparse configurations, reducing computational overhead. Second, it integrates a convolutional neural network (CNN)-based architecture to detect and associate body joints, even within complex and cluttered scenes. These modifications allow the model to approach the computational complexity of tracking with a balanced trade-off in accuracy.

The methodology adopted is compelling in its joint modeling approach, allowing the model to interpret and track articulated human poses effectively over time. The paper introduces a graph partitioning formulation, leveraging recent combinatorial optimization techniques to hasten inference, which is a notable deviation from conventional pedestrian tracking methods. Unlike existing systems, this model integrates temporal information, improving predictions in scenes suffering from person overlap and articulation challenges.

Empirical evaluation conducted on the "MPII Human Pose" and a new "MPII Video Pose" dataset demonstrates the enhanced capability of the proposed model. These evaluations underscore its state-of-the-art performance, rivalling or surpassing contemporary methods while operating at a significantly reduced time cost. This paper reports an efficient inference time, showing substantial speed-ups against earlier methods like DeeperCut.

The introduction of temporal edges and the application of a sparse graph configuration stand out as pivotal innovations. Temporal reasoning is enhanced through features like spatial propagation, a technique that leverages intermediate supervision layers to ensure better accuracy on distant joint predictions, particularly the extremities.

This paper does not merely aim to solve problems in crowded scenes but also proposes a scalable model that accommodates an unknown number of subjects, posing opportunities for application in real-time scenarios. The approach boasts improved processing time through bottom-up and top-down/bottom-up reasoning, which could revolutionize video-based applications in crowded environments like sports analytics, surveillance, and interactive systems.

Looking ahead, these findings might inspire future research to explore sparsity concepts and the use of CNNs for feature extraction in multi-object tracking tasks. Exploring other forms of temporal connectivity could mitigate long-term occlusions and improve persistence in tracking accuracy. Additionally, incorporating more sophisticated deep learning techniques might further exploit the temporal and spatial dimensions, pushing the boundaries of articulated tracking capabilities.

In conclusion, the paper makes a substantial contribution to articulated pose tracking, presenting an efficient, innovative model that performs assignment of body parts to distinct individuals while offering a roadmap for potential developments in the realms of AI and computer vision. Future ventures may likely extend beyond temporal modeling, prioritizing enhanced generalization of human pose tracking in even more dynamic and complex video streams.

PDF Markdown

Related Papers

YouTube

Show All Videos