PoseTrack: Joint Multi-Person Pose Estimation and Tracking (1611.07727v3)

Published 23 Nov 2016 in cs.CV

Abstract: In this work, we introduce the challenging problem of joint multi-person pose estimation and tracking of an unknown number of persons in unconstrained videos. Existing methods for multi-person pose estimation in images cannot be applied directly to this problem, since it also requires to solve the problem of person association over time in addition to the pose estimation for each person. We therefore propose a novel method that jointly models multi-person pose estimation and tracking in a single formulation. To this end, we represent body joint detections in a video by a spatio-temporal graph and solve an integer linear program to partition the graph into sub-graphs that correspond to plausible body pose trajectories for each person. The proposed approach implicitly handles occlusion and truncation of persons. Since the problem has not been addressed quantitatively in the literature, we introduce a challenging "Multi-Person PoseTrack" dataset, and also propose a completely unconstrained evaluation protocol that does not make any assumptions about the scale, size, location or the number of persons. Finally, we evaluate the proposed approach and several baseline methods on our new dataset.

Citations (206)

View on Semantic Scholar

Summary

The paper introduces a joint framework for multi-person pose estimation and tracking using a spatio-temporal graph optimized via integer linear programming.
The method employs robust spatial and temporal constraints to mitigate occlusion and truncation challenges in dynamic video sequences.
The newly introduced Multi-Person PoseTrack dataset establishes a comprehensive benchmark, outperforming traditional isolated pose estimation methods.

Evaluating Simultaneous Pose Estimation and Tracking with PoseTrack

The academic paper titled "PoseTrack: Joint Multi-Person Pose Estimation and Tracking" tackles the simultaneous estimation and tracking of multiple human poses within unconstrained video sequences, addressing a problem that had remained quantitatively unexplored until its advent. The authors, Umar Iqbal, Anton Milan, and Juergen Gall, present an innovative approach that unifies multi-person pose estimation and tracking by developing a spatio-temporal graph-based formulation optimized using integer linear programming (ILP). This paper not only proposes a robust method for handling occlusion and truncation but also introduces the comprehensive Multi-Person PoseTrack dataset, providing a new benchmark for this dual task.

Methodological Advances

The authors' approach involves constructing a graph where body joint detections form nodes that are interconnected through spatial and temporal edges. The graph effectively captures both spatial relationships within each frame and temporal continuities across frames. Optimization of this graph involves partitioning it into sub-graphs, each representing plausible body pose trajectories through ILP, thereby ensuring a coherent solution that avoids ambiguities like multiple joint assignments to a single body part.

The method's strength lies in its structural constraints, which facilitate both pose estimation and tracking accuracy. Spatial transitivity constraints ensure consistency within frames, while temporal and spatio-temporal constraints enforce coherence over time. This unified model differentiates the approach from traditional methods that treat pose estimation and tracking in isolation or rely heavily on external pre-processed inputs such as bounding box detections.

Dataset and Evaluation Protocol

Acknowledging the lack of a suitable dataset for such an integrated task, the authors contribute the Multi-Person PoseTrack dataset, notable for its diversity and complexity. It spans various scenes with multiple, interacting individuals, exhibiting wide-ranging pose variations and dynamic occlusions—conditions reflective of real-world scenarios.

The dataset supports a comprehensive evaluation protocol designed to separately assess joint localization and identity association. Updated metrics address shortcomings in existing evaluation standards by considering occluded joints and promoting robust methodologies that accurately handle occlusion.

Empirical Evaluation

In empirical evaluations on the new dataset, the proposed method demonstrates superior performance over baseline approaches, notably outperforming models that depend on decoupled detector-based pipelines. Notably, the method achieves higher Multi-Object Tracking Accuracy (MOTA) and Mean Average Precision (mAP), showcasing its efficacy in both tracking precision and pose estimation accuracy.

Theoretical and Practical Implications

This paper lays theoretical groundwork for a more holistic understanding of pose estimation and tracking, suggesting that joint modeling of these tasks can lead to improved results compared to handling them separately. Practically, the method can potentially enhance applications in video surveillance, sports analytics, and human-computer interaction where real-time insights into human motion are crucial.

Future Directions

Looking ahead, the research opens several avenues for exploration. Integrating more sophisticated deep learning models could further refine joint detection and edge probabilities. Additionally, incorporating context or scene understanding might elevate the model's robustness against edge-case scenarios such as severe occlusions or occluded body parts transitioning into visibility.

Conclusion

"PoseTrack: Joint Multi-Person Pose Estimation and Tracking" embodies a significant stride towards resolving the complexities of simultaneous pose estimation and tracking. Through its novel approach and the introduction of the Multi-Person PoseTrack dataset, it sets a substantial precedent for future research in dynamic scene analysis, providing both a methodological framework and a benchmark for evaluating subsequent innovations in the field.