WorldPose: A World Cup Dataset for Global 3D Human Pose Estimation (2501.02771v2)

Published 6 Jan 2025 in cs.CV

Abstract: We present WorldPose, a novel dataset for advancing research in multi-person global pose estimation in the wild, featuring footage from the 2022 FIFA World Cup. While previous datasets have primarily focused on local poses, often limited to a single person or in constrained, indoor settings, the infrastructure deployed for this sporting event allows access to multiple fixed and moving cameras in different stadiums. We exploit the static multi-view setup of HD cameras to recover the 3D player poses and motions with unprecedented accuracy given capture areas of more than 1.75 acres. We then leverage the captured players' motions and field markings to calibrate a moving broadcasting camera. The resulting dataset comprises more than 80 sequences with approx 2.5 million 3D poses and a total traveling distance of over 120 km. Subsequently, we conduct an in-depth analysis of the SOTA methods for global pose estimation. Our experiments demonstrate that WorldPose challenges existing multi-person techniques, supporting the potential for new research in this area and others, such as sports analysis. All pose annotations (in SMPL format), broadcasting camera parameters and footage will be released for academic research purposes.

Summary

The paper introduces WorldPose, a large-scale dataset featuring over 2.5M 3D human poses captured during the 2022 FIFA World Cup.
It employs robust multi-view and broadcast camera calibration techniques integrated with the SMPL model, achieving an average 8 cm joint error.
The dataset enhances sports analytics by challenging existing multi-person pose estimation methods in expansive, real-world outdoor environments.

An Overview of "WorldPose: A World Cup Dataset for Global 3D Human Pose Estimation"

The paper introduces "WorldPose," an innovative dataset purpose-built for advancing research in the domain of multi-person global 3D human pose estimation. The dataset originates from a unique opportunity presented by the 2022 FIFA World Cup, enabling the capture of realistic, in-the-wild data that pushes the boundaries of what previous datasets have offered.

Dataset Composition and Significance

WorldPose is distinguished by its scale and detail, incorporating footage captured via extensive infrastructure at multiple stadiums, featuring both fixed and moving camera setups. The dataset comprises over 80 sequences and delivers approximately 2.5 million annotated 3D poses, spanning a player movement distance totaling over 120 km. This scale marks a significant improvement over existing datasets, particularly in terms of multi-view and multi-person data captured in expansive, unconstrained outdoor environments. By employing the SMPL model for pose representation, WorldPose provides rich shape and pose data that challenge existing pose estimation methods.

Methodological Insights

The methodology for dataset creation leverages multi-view static cameras, known for providing reliable calibration results when combined with careful manual refinement. Key components of the methodology include:

Static Camera Calibration: This phase involves treating the soccer pitch as a planar surface initially, followed by refinement using a non-linear optimization to accommodate field crown effects and lens distortion.
3D Human Pose and Shape Estimation: Following calibration, player bounding boxes are detected, and 2D keypoints are identified using refined state-of-the-art models. These keypoints are then triangulated into 3D joint positions and integrated into the SMPL model framework. The dataset thus captures dynamic player movements with accuracy and continuity that support robust analysis.
Broadcasting Camera Calibration: The paper addresses challenges associated with moving cameras, incorporating a semi-automated calibration approach augmented with constraints from player poses and field markings for smoother tracking.

Implications and Future Directions

The paper performs rigorous evaluation using Vicon data as a benchmark, demonstrating the dataset's accuracy with an average error of just 8 cm per joint. Evaluations of state-of-the-art methods like GLAMR and SLAHMR on WorldPose highlight issues these methods face, such as estimating correct relative positioning across multiple players.

WorldPose is poised to impact several domains significantly. Beyond standard pose estimation challenges, it opens new avenues in sports analytics, enabling enhanced analyses of team dynamics, strategy, and individual performance assessment. The real-world, high-resolution nature of the dataset signifies its potential to aid in training and evaluating deep learning models under realistic and challenging conditions.

Moreover, the insights gleaned from the dataset's creation methodology and evaluations suggest avenues for improvement in SLAM algorithms and pose estimation networks, particularly regarding robustness in unconstrained environments with significant inter-player interactions.

In conclusion, WorldPose emerges as a pivotal dataset in the field of Computer Vision, setting a new benchmark for multi-person 3D pose estimation. Looking ahead, expanding the dataset to include more diverse activities and events can further enhance its applicability and aid the rapid evolution of pose estimation models tailored for real-world applications.