Efficient Online Multi-Person 2D Pose Tracking with Recurrent Spatio-Temporal Affinity Fields (1811.11975v3)

Published 29 Nov 2018 in cs.CV

Abstract: We present an online approach to efficiently and simultaneously detect and track the 2D pose of multiple people in a video sequence. We build upon Part Affinity Field (PAF) representation designed for static images, and propose an architecture that can encode and predict Spatio-Temporal Affinity Fields (STAF) across a video sequence. In particular, we propose a novel temporal topology cross-linked across limbs which can consistently handle body motions of a wide range of magnitudes. Additionally, we make the overall approach recurrent in nature, where the network ingests STAF heatmaps from previous frames and estimates those for the current frame. Our approach uses only online inference and tracking, and is currently the fastest and the most accurate bottom-up approach that is runtime invariant to the number of people in the scene and accuracy invariant to input frame rate of camera. Running at $\sim$30 fps on a single GPU at single scale, it achieves highly competitive results on the PoseTrack benchmarks.

Citations (104)

View on Semantic Scholar

Summary

The paper introduces a novel cross-linked temporal affinity field that enhances pose tracking coherence across video frames.
It employs a recurrent framework utilizing previous heatmaps to reduce computational redundancy and deliver real-time processing.
The model achieves competitive mAP and MOTA scores on PoseTrack 2017 while remaining scalable with a bottom-up approach.

Efficient Online Multi-Person 2D Pose Tracking with Recurrent Spatio-Temporal Affinity Fields

This paper presents a sophisticated methodology for multi-person 2D pose estimation and tracking in video sequences through the integration of Recurrent Spatio-Temporal Affinity Fields (STAF). This approach builds upon the foundational Part Affinity Fields (PAF), initially designed for static images, extending these concepts to handle the additional complexities present in time-sequenced data.

Key Contributions

Temporal Affinity Fields (TAFs) Expansion: The authors introduce a cross-linked temporal topology that enhances temporal coherence in pose tracking, effectively managing various magnitudes of body motion across video frames. This allows the model to maintain accurate associations of body limbs even in cases of minimal motion, which traditional temporal models struggle with.
Recurrent Framework: Exploiting the recurrent nature of sequential data, the network uses STAF heatmaps generated from previous frames to improve current frame estimates. This recurrent pipeline facilitates real-time inference by reducing computational redundancy, ensuring scalability across different scenarios and scene complexities.
Real-time Performance: The proposed model's real-time capabilities are a standout feature, performing at approximately 30 frames per second on a single GPU without sacrificing accuracy for speed or runtime efficiency. This is achieved through a bottom-up approach, which remains invariant in runtime as the number of people in a scene changes.

Numerical Results and Claims

The authors report that their model achieves a mean Average Precision (mAP) of 64.6% and a Multiple Object Tracking Accuracy (MOTA) of 58.4% using a single-scale input at 30 FPS. When executing in a multi-scale setting at 7 FPS, these metrics improve to 71.5% mAP and 61.3% MOTA on the PoseTrack 2017 validation set. Additionally, the approach ranks competitively among existing models, securing the second place for accuracy and third for tracking at the time of evaluation in the 2017 challenge.

Implications for AI Development

The practical utility of this work lies in its potential application across industries requiring real-time processing of video data, such as autonomous vehicles and augmented reality systems. The flexibility and speed of the model make it ideally suited for environments where low latency is critical. Academically, this research paves the way for future exploration into recurrent frameworks for video analysis, encouraging the development of more nuanced models that can directly leverage temporal information.

Theoretic and Practical Future Directions

Theoretically, the novel cross-linked limb topology suggests avenues for future exploration in extending spatial relations across temporal dimensions. Further work may investigate the adaptation of such models to adjust dynamically to varying framerates and resolutions, enhancing scalability and robustness. Practically, enhancing the robustness of the model in regard to shot changes and integrating it with shot detection algorithms could significantly improve tracking persistence and reduce the occurrence of errors due to abrupt scene changes.

Overall, this paper significantly contributes to advancing real-time multi-person pose tracking methodologies, promoting further innovation in both academic research and practical applications.

PDF Markdown

Related Papers

YouTube

Show All Videos