Multi-Camera Self-Calibration in Sports Motion Capture: Leveraging Human and Stick Poses

Published 19 Apr 2026 in cs.CV and eess.IV | (2604.17567v1)

Abstract: Multi-camera systems are widely employed in sports to capture the 3D motion of athletes and equipment, yet calibrating their extrinsic parameters remains costly and labor-intensive. We introduce an efficient, tool-free method for multi-camera extrinsic calibration tailored to sports involving stick-like implements (e.g., golf clubs, bats, hockey sticks). Our approach jointly exploits two complementary cues from synchronized multi-camera videos: (i) human body keypoints with unknown metric scale and (ii) a rigid stick-like implement of known length. We formulate a three-stage optimization pipeline that refines camera extrinsics, reconstructs human and stick trajectories, and resolves global scale via the stick-length constraint. Our method achieves accurate extrinsic calibration without dedicated calibration tools. To benchmark this task, we present the first dataset for multi-camera self-calibration in stick-based sports, consisting of synthetic sequences across four sports categories with 3 to 10 cameras. Comprehensive experiments demonstrate that our method delivers SOTA performance, achieving low rotation and translation errors. Our project page: https://fandulu.github.io/sport_stick_multi_cam_calib/.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a novel self-calibration pipeline that fuses human motion and fixed stick-length constraints to resolve geometric and metric ambiguities.
The methodology combines unscaled bundle adjustment with stick-based scale recovery and a scale-aware refinement stage, ensuring precise metric calibration.
Results demonstrate state-of-the-art rotation and translation accuracy with consistent performance across varying camera counts and detection noise levels.

Multi-Camera Self-Calibration in Sports Motion Capture: Leveraging Human and Stick Poses

Problem Formulation and Motivation

Extrinsic calibration of multi-camera systems is a critical yet operationally expensive prerequisite for accurate 3D motion capture in sports analytics. Traditional solutions require dedicated calibration objects (e.g., checkerboards, wands), which are unwieldy for dynamic and frequently changing setups, particularly on fields or in arenas where rapid (re)deployment is necessary. Recent tool-free approaches either rely on natural human motion—leading to scale ambiguity—or demand auxiliary sensors (LiDAR, IMUs), raising deployment cost and complexity. Learning-based methods have similarly struggled with poor inductive biases regarding metric scale.

This work addresses these limitations by introducing a tool-free, RGB-only self-calibration pipeline that leverages two ubiquitous sources of geometric evidence in many sports: articulated human motion and the rigid-body constraint of a stick-like sports implement (e.g., bat, club, or stick) of known length. Through a principled combination of BA stages and a newly curated multi-sport synthetic dataset, the paper achieves accurate, metric extrinsic calibration suitable for downstream biomechanical and sports analytics without requiring specialized hardware or training.

Figure 1: Overview—tool-free multi-camera extrinsic calibration jointly leveraging stick and human pose information.

Methodological Contributions

The approach is characterized by a three-stage optimization pipeline which fuses human and stick cues to resolve both geometric and metric ambiguities:

Unscaled Bundle Adjustment (BA): A standard non-linear BA jointly optimizes camera extrinsics and 3D trajectories of the athlete and stick endpoints, based on multi-view 2D detections. This stage ensures geometric consistency but does not address unknown metric scale.
Scale Recovery via Stick Length: Given the regulation-compliant length of the sport implement, the pipeline computes the average reconstructed length post-BA and rescales translation and structure estimates to recover the real-world metric.
Scale-Aware BA Refinement: The initial solution is then refined with a combined loss integrating (i) robust, visibility-masked reprojection error, (ii) a hard constraint on the stick’s length, and (iii) temporal smoothness priors on 3D joint/endpoint trajectories to mitigate impact of 2D pose detection noise and encourage biomechanical plausibility.
Figure 2: Three-stage pipeline—unscaled BA, stick-based scale resolution, scale-constrained BA.

The implementation is efficient, with a sparse problem representation facilitating practical application to scenarios with up to 10 cameras in challenging sports configurations.

Dataset and Benchmark Protocol

To permit rigorous and reproducible evaluation, the authors introduce Sports-Stick-Syn, the first synthetic dataset for multi-camera self-calibration tailored to stick-based sports. The dataset comprises four sports (golf, baseball, hockey, kendo), each with standardized implement lengths, realistic biomechanical motion primitives, and a wide variety of camera counts (3–10), topologies, and simulated 2D detection noise profiles. This annotated dataset enables ground-truth evaluation of extrinsic calibration both in rotation and translation, using protocols that are standard in the community (e.g., Kabsch-Umeyama alignment, reporting mean/variance with respect to angular and Euclidean camera-center errors).

Figure 3: Dataset—sports simulated, camera statistics, and visualization; diverse, noise-aware, and publicly available.

Quantitative and Qualitative Results

Comprehensive experiments demonstrate robust generalization and state-of-the-art accuracy compared to both traditional tool-based baselines (e.g., checkerboard/ArUco techniques) and learning-based or human-only markerless calibration pipelines.

Numerical Highlights:

The proposed method achieves an average rotation error of 0.020 $^\circ$ and translation error of 0.001 m, outperforming all baselines including ArucoCalib (0.087 $^\circ$ /0.007 m), CalibPerson (0.268 $^\circ$ /0.072 m), and Kineo (0.075 $^\circ$ /0.098 m).
The approach is robust to camera counts, spatial layout, and increasing levels of 2D noise—demonstrated by consistently low median/variance in error metrics across all experiments.
Ablation studies confirm:
- Including both stick and human cues offers superior rotation and translation performance to using either cue in isolation.
- The metric stick constraint is essential; its removal increases translation error by over an order of magnitude.
- Temporal smoothness regularization further stabilizes solution quality, especially under degraded input conditions.
- Figure 4: Runtime—scalability to increasing camera numbers with low variance; median/worst-case resource consumption remains operationally efficient.
- Figure 5: Per-camera error—human+stick yields tighter distributions and smaller variance in rotation and translation than human-only.
- Figure 6: Per-noise-level error—strong resilience to increasing noise levels, outperforming human-only even under challenging conditions.

Practical Implications and Downstream Applications

An end-to-end, markerless and RGB-only calibration workflow tailored to sports is of immense significance for in situ biomechanical analysis, sports gesture mining, and athlete tracking, particularly outside controlled indoor labs or where ground markers are absent (e.g., fairways, courts, or arenas). The approach is inherently compatible with common pose estimation frameworks (e.g., YOLOv11 or MMPose) and does not require retraining or additional annotation.

The method’s ability to reconstruct metric 3D joint and implement trajectories enables analytic tools not previously accessible in rapid-deployment, unconstrained scenarios—such as athlete swing phase segmentation, posture and kinematic descriptor extraction, and detailed cross-trial performance analysis.

Figure 7: End-to-end impact—tool-free, scale-aware 3D pose reconstruction powers new analytics for gesture/movement study in sports settings.

Theoretical Impacts and Future Directions

This work demonstrates that the combination of rigid-length constraints with articulated, biometrically structured motion is sufficient to resolve both geometric and metric ambiguities in multi-view self-calibration, obviating the need for specialized artifacts. The results suggest that leveraging semantically meaningful rigid bodies (which are common in many task domains) can generalize this class of extrinsic self-calibration methods to broader, more complex motions and system geometries.

Potential future research directions include:

Extension to non-rigid implements or compound objects with partially known structure.
Integration with low-confidence, unsynchronized views, relaxing current synchronization assumptions.
Further coupling with end-to-end learning-based multi-task networks for unified detection, calibration, and tracking in the wild.

Conclusion

This paper introduces a robust, practical, and markerless calibration pipeline for multi-camera systems in sports environments, realized through the joint exploitation of human pose and stick-length constraints. The proposed approach achieves strong, consistent, and metric-accurate extrinsic calibration across a variety of sports, camera layouts, and noise conditions, while maintaining operational efficiency and simplicity. Both empirical and theoretical analysis underscores its suitability as an accessible foundation for scaled deployment of high-fidelity sports analytics and motion capture.

Markdown Report Issue