CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation
Abstract: The Caltech Tennis Dataset (CalTennis) is a large-scale video benchmark for evaluating monocular-to-3D pose estimation in the wild. CalTennis comprises over 11 million frames (51 hours) of tennis practice and match play from 40 players, captured with 2-6 synchronized cameras at 60 Hz. It is 10 times larger than existing in-the-wild human motion video datasets and 3 times larger than existing MOCAP-ground-truthed datasets, and it is the first large-scale benchmark to provide synchronized multi-view recordings of expert athletic motion. The multi-view setup enables inexpensive, label-free evaluation of monocular-to-3D pose estimation algorithms. We describe a simple, standardized protocol that enables data collection without specialized equipment or expertise, along with fully automated video calibration and synchronization. Benchmarking state-of-the-art monocular-to-3D pose methods on CalTennis, we find that while 3D joint angle recovery is now quite accurate, all models struggle to estimate depth and foot contact consistently. We further propose two novel performance metrics, footwork and stability, as well as qualitatively study body shape inconsistency. These metrics expose previously underexplored failure modes and point to concrete opportunities for improvement in pose estimation and action analysis.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper introduces CalTennis, a giant collection of tennis videos recorded from several phones at the same time. The goal is to help computers figure out how people move in 3D (where every joint is in space) using just normal video. The authors also show new, simple ways to judge how well these computer programs really understand movement, especially in fast, skilled sports like tennis.
What questions were the researchers trying to answer?
They focused on a few simple questions:
- Can we build a big, real-life video dataset of athletes that’s easy and cheap to collect with regular phones?
- Can we check how accurate 3D pose estimates are without using expensive motion-capture suits or markers?
- Where do today’s best pose-estimation programs still make mistakes, especially in sports?
- Can we design new tests that catch the kinds of errors coaches and clinicians actually care about (like foot contact with the ground and balance)?
How did they do the study?
To make this clear, here’s what they built and how it works in everyday terms.
Building the dataset
- They recorded 51 hours of tennis (over 11 million video frames) from 40 players using 2–6 iPhones at once, all shooting at 60 frames per second.
- Cameras were placed on tripods around the court so the same action is seen from different angles at the same time.
- They blurred faces and got consent to protect privacy.
Why tennis? Courts are standardized (every court has the same lines), actions are fast and varied, and players are rarely blocked from view.
Making the cameras “agree” on space and time
- Space: The white lines on a tennis court are like a ruler printed on the ground. By matching where those line intersections appear in each camera, the team can figure out exactly where each camera is in 3D space. This is called “calibration.”
- Time: Different phones don’t start at the exact same millisecond. The team slides the timelines to line up the same moments (like matching the peak of a serve across views). This is “synchronization.”
Once space and time are aligned, any person’s 3D position estimated from one camera should match the estimate from another camera. If they don’t match, we know the program made an error—no expensive lab equipment needed.
What is “3D pose estimation” anyway?
- Imagine a stick-figure inside a person: hips, knees, ankles, shoulders, etc. 3D pose estimation means guessing where all those joints are in 3D space using video.
- “Monocular” means using a single camera. It’s hard to get depth (how far away someone is) from one view, like trying to guess how far a car is from a single photo. Multiple views make it easier to check if the single-camera guess makes sense.
New ways to measure quality
The authors didn’t just measure the usual “how far are the joints off.” They also added tests that matter for sports:
- Translation (position) error: Do different views agree on where the player is on the court?
- Pose error: Ignoring where the player is on the court, do the joint positions match across views?
- Footwork errors:
- Foot velocity mismatch: Do different views agree on how the feet are moving? If not, the model may make feet “slide” or “float.”
- Foot height mismatch: Do views agree about whether a foot is on the ground?
- Stability: Is the body’s center of mass over the feet? If different views disagree, the model may misunderstand balance.
- Body shape consistency: Does the model give the same person the same body proportions from different camera angles? If not, it’s inconsistent.
Testing today’s top programs
They ran five state-of-the-art programs on CalTennis and compared how often the programs agreed across views. If the views disagree, that reveals errors.
What did they find, and why is it important?
Main findings:
- Joint angles are often good. Programs usually get the body’s shape of a movement (like how bent the knees and elbows are) fairly well.
- Depth and position are shaky. Programs struggle to keep the person in the right place in 3D. The same player can “jump” forward and backward in space from frame to frame, which is unrealistic.
- Foot contact is inconsistent. Programs often disagree about whether a foot is on the ground or how it moves, leading to “foot skating” (feet sliding when they should be planted).
- Body shape isn’t consistent. The same person can look taller or have different limb lengths depending on the camera view or program, which should not happen.
Why this matters:
- For coaching or clinical analysis, small errors in depth, ground contact, or body proportions can lead to wrong conclusions about balance, forces, and technique.
- The new footwork and stability tests reveal problems that older benchmarks missed. That’s important because these details matter most for sports, health, and safety.
What’s the impact of this research?
- A practical way to evaluate accuracy without expensive motion capture: Just use multiple phones, a known court layout, and compare across views.
- A much bigger, more realistic dataset for training and testing pose estimation in sports. This can push the field forward toward real-world use.
- Clear directions for improvement:
- Make depth and position steadier and more realistic.
- Get better at spotting foot-ground contact.
- Keep a person’s body shape consistent over time and across views.
In short, if you want to analyze technique or recognize actions, current tools are getting close. But if you need exact distances, balance, or force estimates (like in detailed coaching, medical rehab, or forensic measurements), today’s systems still aren’t reliable enough. CalTennis shows exactly where they fall short and offers better ways to measure progress.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, phrased to be actionable for future research.
- Dataset scope and diversity: The benchmark is limited to a single sport (tennis), a single institution, similar climates, and one surface type (likely hard court); it lacks data from clay/grass, indoor courts, varied lighting/weather conditions, and broader geographic/demographic diversity (age, body types, attire), raising questions about generalization.
- Replicability of the capture recipe: The “easy and inexpensive” data-collection protocol has not been independently replicated by external teams; the field lacks a multi-site replication study validating feasibility, consistency, and data quality across different collectors and devices.
- Camera hardware variability: All capture used iPhones (model 14+), fixed tripod height, and a particular lens; it is unknown how results change with Android devices, different focal lengths, camera heights/angles, rolling-shutter characteristics, or handheld capture.
- Calibration robustness and uncertainty: The automatic extrinsic calibration from court lines is not evaluated for accuracy or failure modes (faded/occluded lines, non-standard courts, lens distortion, shadows, night lighting); the paper does not quantify calibration uncertainty or propagate it into confidence intervals for the consistency metrics.
- Temporal synchronization limits: The method uses a global offset search in ±1000 ms but does not address device clock drift, per-camera clock offsets, rolling-shutter timing, or non-linear timing misalignments; it is unclear how residual mis-sync biases cross-view consistency metrics and how sensitive results are to sync errors.
- Person association across views: The paper does not detail the multi-person, multi-view identity matching strategy (especially during doubles play or occlusions); evaluation may conflate reconstruction error with association errors, and the impact of association failures on the metrics is unquantified.
- Pair and geometry selection: Consistency is computed from overlapping views, but the paper does not specify how view pairs are chosen or weighted; the effect of camera baseline length, view angle, and number of views (2 vs 4 vs 6) on measured errors is not analyzed.
- Lower-bound limitation of label-free evaluation: Multi-view agreement only lower-bounds error and may miss correlated cross-view biases (e.g., systematic shape or depth errors that agree across views); the dataset lacks any absolute ground truth (e.g., small MOCAP or IMU subsets) to quantify the gap between the lower bound and true error.
- Missing ablation with known cameras: Models estimated their own camera parameters; the benchmark does not evaluate how providing known extrinsics/intrinsics (from the court-based calibration) affects results, making it hard to separate camera-estimation error from pose-estimation error.
- Metric definitions and implementation details: New metrics (footwork, stability) depend on ground-plane estimation, foot contact inference, and CoM computation from SMPL-X; their sensitivity to calibration, model priors, and thresholding is not examined; the shape-consistency metric is not formally defined in the main text.
- Foot contact and ground-truth validation: Footwork metrics rely on cross-view agreement rather than true contact labels; there is no validation against annotated foot-strike events, pressure insoles, or force plates even on a small subset to calibrate the metric’s absolute correctness.
- Depth and translation instability analysis: While all models show “pose drift” and large depth/translation inconsistencies, the causes (e.g., SLAM failure, intrinsic/extrinsic inaccuracies, court-plane constraints unused by models) are not dissected; ablations to isolate factors are missing.
- Temporal smoothness and jitter: The paper notes instability but does not quantify temporal jitter spectra (e.g., power spectral density of pelvis/CoM) or evaluate temporal consistency metrics beyond frame-to-frame foot velocities; this limits insight into time-dependent artifacts.
- Action semantics and labels: Despite focusing on “meaningful, repeated actions,” the dataset lacks semantic annotations (e.g., stroke type, swing phases, foot-contact events, rally segments); this constrains downstream behavioral or technique analysis and prevents per-action diagnostic evaluations.
- Ball and racket signals: The benchmark does not include ball or racket trajectories/annotations, even though they offer strong cues for synchronization, action segmentation, and physical plausibility checks (e.g., timing of impacts).
- Player identity and shape ground truth: There is no ground-truth height/limb-length data; shape consistency is only assessed via model agreement, not against measured anthropometrics, preventing validation of shape and scale estimates.
- Effect of face blurring: Mandatory face blurring may affect head/neck pose, identity tracking, or detector performance; the impact of blurring on pose and shape estimates is not quantified.
- Standardized data splits and leakage prevention: The paper does not define fixed train/val/test splits by player/session/court configuration for fair comparisons and to prevent identity/session leakage; current evaluation uses a subset (first 5M frames) and notes “full results forthcoming.”
- Full-dataset evaluation and variance: Reported results cover only ~5M of ~11M frames; performance variability across sessions, camera configurations, times of day, or players is not reported, limiting understanding of generalization within the dataset.
- Cross-sport generalization: While the approach is claimed to be portable to other sports, no evidence is given; whether court-line-based calibration and the label-free lower-bound evaluation transfer to sports without standardized markings remains open.
- Consensus reconstructions as benchmarks: The dataset purposely avoids triangulating pseudo ground truth; however, the community lacks a study comparing label-free lower bounds to multi-view consensus/triangulated meshes (with uncertainty) as an intermediate reference.
- Physics priors and constraints: The metrics identify foot skating and stability discrepancies, but the benchmark does not test methods that explicitly enforce physical constraints (e.g., ground-plane sticking, zero-moment point bounds) or quantify resulting improvements.
- Multi-view fusion upper bounds: Only monocular methods are benchmarked; the dataset does not provide reference baselines for multi-view fusion methods on the same videos to estimate an achievable upper bound under the same conditions.
- Sensitivity to court-plane estimation: Stability and foot height metrics assume an accurately estimated court plane; the sensitivity of these metrics to plane estimation error is not quantified or bounded.
- Association of metric errors with downstream tasks: The paper argues certain errors harm biomechanics and coaching but does not empirically correlate metric deviations (e.g., foot-contact inconsistency) with downstream task performance (e.g., GRF estimation error, misclassification of technique faults).
- Demographic and skill-level representativeness: Players are primarily collegiate/recreational from a single university; generalization to youth, elite professionals, or para-athletes is untested.
- Licensing and data-use constraints: The paper does not detail licensing terms, usage restrictions, or whether redistribution of derived labels (e.g., future community annotations) is permitted, which may affect community contribution and extensibility.
- Robustness under occlusion and crowding: Tennis is relatively unoccluded; the dataset offers limited coverage of strong occlusions, crowds, or complex multi-person interactions, leaving generalization to more congested scenes uncertain.
- Uncertainty quantification: Models’ predictive uncertainties are not reported or used; the evaluation lacks a framework to propagate model and calibration uncertainties into confidence intervals on the new metrics.
- Automated failure detection: No methodology is proposed to detect when cross-view agreement is spuriously high due to correlated errors (e.g., shared priors), nor to flag frames where calibration/sync likely failed.
- Open evaluation protocols: Details such as person matching, view-pair selection, and parameter choices for metrics are not fully specified or standardized in the main text; a more explicit protocol is needed to ensure reproducibility and fair comparison across future methods.
Practical Applications
Immediate Applications
Below are concrete, deployable uses that can be implemented today, leveraging the dataset, evaluation protocol, and metrics as-is. Each item names the sector(s), the application, potential tools/workflows, and key assumptions/dependencies.
- Sector: Sports analytics, Software
- Application: Low-cost multi-view capture kit for tennis clubs and teams
- Tools/workflows: Two to six iPhones on $40 tripods; automatic court-based camera calibration and video synchronization; packaged “CalTennis Capture” scripts; face-blurring pipeline
- Assumptions/dependencies: Visible court lines, overlapping fields of view, basic placement protocol adherence, participant consent and privacy controls
- Sector: ML/AI tooling (pose vendors, startups), Academia
- Application: Label-free QA/benchmarking harness for monocular 3D pose estimators using multi-view consistency
- Tools/workflows: CI pipelines that run cross-view MPJPE/PA-MPJPE, translation error, footwork (skating/height), and stability metrics; failure dashboards; regression tests across CalTennis splits
- Assumptions/dependencies: Multi-view footage of the same event; reliable court-based calibration; models export poses in a compatible format (e.g., SMPL/SMPL-X)
- Sector: Sports coaching (tennis), Education
- Application: “Technique-only” coaching feedback from consumer video
- Tools/workflows: On-device or cloud app that extracts joint-angle trajectories, relative limb configurations, temporal phase segmentation of strokes, and qualitative comparisons to reference libraries
- Assumptions/dependencies: Current models are accurate for joint-angle/pose but not for absolute depth, foot contacts, or body shape; feedback must exclude balance/force metrics and absolute distances
- Sector: Media/Animation, Gaming
- Application: Automated foot-skating and stability QA for character animation
- Tools/workflows: Plug-ins for DCC tools (Blender, Maya, Unreal) that compute CalTennis “footwork” and “stability” metrics on animated sequences to flag sliding/imbalance; batch tests for mocap retargeting
- Assumptions/dependencies: Access to joint trajectories or mesh data; tolerance thresholds adapted to animation scale
- Sector: Academia (CV, robotics, HCI), Education
- Application: Course labs and reproducible assignments on multi-view calibration, synchronization, and label-free evaluation
- Tools/workflows: Classroom kits (2–4 phones + tripods), notebook-based implementations of court-line calibration and temporal offset search, side-by-side model comparisons on a provided subset
- Assumptions/dependencies: Access to a court or any field with known markings; student consent processes; basic compute for running baseline models
- Sector: ML research (perception, graphics)
- Application: Model development focusing on depth/contact/shape failure modes highlighted by new metrics
- Tools/workflows: Training/evaluation loops that report foot-skating, stability, and shape consistency; ablations on gravity-view/world-grounded designs; data curation emphasizing far-depth and high-speed actions
- Assumptions/dependencies: Access to CalTennis or similar multi-view sports footage; integration of metric reporting into experiments
- Sector: Sports organizations, Policy/Compliance
- Application: Replicable, low-cost data collection programs for federations and schools
- Tools/workflows: Written protocols for camera placement, consent, face blurring, and data sharing; “capture days” for team practices; standardized metadata templates
- Assumptions/dependencies: IRB/ethics approval as needed; facility permission; adherence to privacy practices (mandatory anonymization)
- Sector: Software (markerless biomechanics)
- Application: Confidence-weighted multi-view consensus reconstruction to produce higher-quality pseudo-labels
- Tools/workflows: MLE fusion with depth-elongated covariances (as described in the appendix) to aggregate multiple monocular estimates into a world frame; used for training supervision or annotation acceleration
- Assumptions/dependencies: At least two overlapping views and successful calibration; consensus is still a lower bound on error without absolute ground truth
- Sector: Daily life, Amateur sports
- Application: DIY self-analysis for tennis form
- Tools/workflows: Two phones recording a hitting session; mobile app computes joint-angle trends and flags large pose inconsistencies between views (as a proxy for low confidence)
- Assumptions/dependencies: Users accept that depth, foot contacts, and body shape are unreliable; daylight/lighting variability handled by model robustness
Long-Term Applications
Below are applications that require further research and engineering, larger-scale deployments, or model advances—especially in absolute depth, body shape, and ground-contact reliability.
- Sector: Sports biomechanics, Healthcare
- Application: Video-only estimates of ground reaction forces, balance, and weight transfer
- Tools/workflows: Physically grounded pose models with reliable foot contact, CoM tracking, and subject-specific shape; cross-view stabilization and force-inference models validated against force plates
- Assumptions/dependencies: Significant improvements in depth/contact/shape; clinical validation; standardized protocols across surfaces and footwear
- Sector: Officiating/Rule enforcement (tennis)
- Application: Automated foot-fault detection and line-proximal contact analysis from commodity cameras
- Tools/workflows: Real-time contact detection fused across multiple sideline cameras; calibrated line geometry; explainable alerts for umpires
- Assumptions/dependencies: High-precision foot contact and metric depth; robust handling of occlusions; broadcast-grade latency and reliability
- Sector: Broadcast, Fan engagement, AR
- Application: Live 3D athlete trajectories, shot-to-shot speed/balance overlays, and tactical summaries
- Tools/workflows: Edge inference across fixed venue cameras; persistent multi-view tracking with depth-stable world coordinates; on-air graphics pipelines using “stability” and “footwork” KPIs
- Assumptions/dependencies: Venue infrastructure; model stability at broadcast distances; low-latency calibration maintenance over long events
- Sector: Injury prevention and return-to-play (sports medicine)
- Application: At-home or on-field assessments of risky mechanics (e.g., valgus moments, landing control) from smartphone video
- Tools/workflows: Risk scoring models combining joint-angle dynamics with accurate foot contact and CoM metrics; longitudinal monitoring dashboards
- Assumptions/dependencies: Clinically validated accuracy thresholds; robust body-shape estimation across clothing and cameras; strict privacy and consent workflows
- Sector: Robotics/Embodied AI
- Application: Learning stable, contact-rich locomotion and manipulation from human demonstrations in the wild
- Tools/workflows: Datasets emphasizing precise foot/hand contacts and balance; contact-consistent motion priors; teleoperation/retargeting pipelines to humanoids
- Assumptions/dependencies: Reliable contact inference and metric depth; sim-to-real alignment; safety and evaluation protocols
- Sector: ML/Benchmarking Standards, Policy
- Application: Cross-sport, label-free benchmarks (basketball, soccer, volleyball) standardized by field geometry and privacy practices
- Tools/workflows: Capture recipes adapted to other sports’ known markings; metric suites (translation, pose, footwork, stability, shape) as required benchmark reporting; governance on consent/anonymization
- Assumptions/dependencies: Field/court line visibility; portability of calibration methods; community adoption and oversight
- Sector: Urban mobility, Intelligent transportation
- Application: Label-free evaluation of pedestrian pose and intent models at intersections using multi-camera city infrastructure
- Tools/workflows: Road marking–based calibration or fiducial placement; cross-view consistency as deployment QA for AV perception stacks
- Assumptions/dependencies: Access to multi-view city cameras; robust calibration under occlusions and lighting; applicable privacy frameworks
- Sector: Security/Forensics
- Application: More reliable gait-based identification incorporating pose dynamics and improved metric depth/shape
- Tools/workflows: Evidence-grade pipelines that quantify multi-view uncertainty and exclude low-confidence frames; court/scene geometry calibration where possible
- Assumptions/dependencies: Legal/ethical safeguards; advances in depth/shape consistency; domain adaptation beyond sports attire and contexts
- Sector: Wearables + Vision fusion
- Application: Reducing or replacing sensors via vision with sparse IMU assistance for contact and depth disambiguation
- Tools/workflows: Hybrid pipelines that use minimal IMUs to anchor contact events and scale, with vision driving kinematics and global trajectories
- Assumptions/dependencies: Tight sensor-video synchronization; comfort and usability; validation against full mocap/force-plate setups
- Sector: Facility operations, Smart venues
- Application: Permanently installed multi-view smartphone-class systems for continuous training analytics
- Tools/workflows: Automatic recalibration, drift detection, and scheduled health checks; privacy-preserving on-premise processing; athlete dashboards with progress over time
- Assumptions/dependencies: Reliable power/networking; robust outdoor calibration across seasons; buy-in on data governance and retention
Notes on feasibility and dependencies across applications
- Model limitations today: Absolute depth, foot contact, and body shape estimates are the bottlenecks; applications relying on these must be deferred or explicitly caveated.
- Environment assumptions: Methods depend on recognizable court/field markings for automatic calibration and sufficiently overlapping camera views.
- Privacy and ethics: Broad deployment requires standardized informed consent, face blurring/anonymization by default, and clear data retention policies.
- Compute and ops: Multi-view processing at 60 Hz and multi-hour sessions require storage, batching, and potential edge/cloud hybrid inference.
- Generalization: CalTennis is tennis-specific; transferring to other sports/domains needs adapted calibration cues and revalidation of metrics and thresholds.
Glossary
- 3D joint angle recovery: Estimating the angles of the human body's joints in 3D from visual data. "we find that while 3D joint angle recovery is now quite accurate"
- camera calibration: Estimating camera intrinsics and extrinsics so 3D points project correctly into the image. "most approaches run SLAM or camera calibration steps and learn body priors."
- camera coordinates: A coordinate frame centered on and oriented with the camera, used to express 3D positions relative to the camera. "before predicting 3D poses in camera coordinates."
- center of mass: The average position of a body’s mass; used here to assess balance relative to foot support. "we define per-view stability as the L2 distance from the projected center of mass to the convex hull of grounded foot joints"
- convex hull: The smallest convex polygon containing a set of points; for stability, the support polygon of grounded feet. "we define per-view stability as the L2 distance from the projected center of mass to the convex hull of grounded foot joints"
- diffusion model: A generative model that learns to reverse a noise diffusion process to produce realistic samples. "GENMO~\cite{yuan2025genmo} is a video-conditioned diffusion model;"
- extrinsics: Camera parameters that define its pose in space (rotation and translation) relative to the world. "intrinsics and extrinsics ."
- foot contact: The event or state of a foot being in contact with the ground, critical for gait and sports analysis. "foot-contact detection is inconsistent across frames and views."
- foot skating: An artifact where feet slide unrealistically on the ground when they should be stationary. "stability, foot skating, and body shape, that expose failure modes invisible to standard benchmarks."
- gravity-view coordinates: An intermediate coordinate system aligned with gravity used to stabilize reconstruction. "GVHMR~\cite{xiaowei2024gvhmr} uses intermediate gravity-view coordinates;"
- IMUs: Inertial Measurement Units; wearable sensors measuring acceleration and angular velocity. "3DPW~\cite{black20183dpw} attaches IMUs"
- in-the-wild: Captured in natural, uncontrolled environments rather than labs or studios. "evaluating monocular-to-3D pose estimation in the wild."
- intrinsics: Internal camera parameters (e.g., focal length, principal point) that affect projection. "intrinsics and extrinsics ."
- label-free: Evaluation or learning without manual ground-truth annotations, often using indirect signals. "The multi-view setup enables inexpensive, label-free evaluation of monocular-to-3D pose estimation algorithms."
- markerless: Motion capture or pose estimation without physical markers attached to subjects. "two smartphones with a markerless pipeline recover clinically useful biomechanics."
- monocular-to-3D pose estimation: Recovering 3D human pose from a single (monocular) camera input. "a large-scale video benchmark for evaluating monocular-to-3D pose estimation in the wild."
- MOCAP: Motion capture; high-accuracy systems (often hardware-based) that record 3D human motion. "Modern MOCAP remains the gold standard for accuracy"
- MPJPE: Mean Per-Joint Position Error; average Euclidean distance between predicted and reference 3D joint positions. "In addition to the standard metrics of MPJPE, PA-MPJPE, and PVE"
- multi-view consistency: Agreement of reconstructions of the same motion across different camera viewpoints. "Overlapping views enable multi-view consistency evaluation."
- PA-MPJPE: Procrustes-Aligned MPJPE; MPJPE after rigid alignment to remove scale/rotation/translation differences. "In addition to the standard metrics of MPJPE, PA-MPJPE, and PVE"
- per-joint articulation: A dataset-level measure of motion variability at each joint across its anatomical range. "Per-joint articulation is the entropy of each joint's angular distribution divided by its anatomical range of motion"
- pose drifting: Undesired temporal oscillation or drift of estimated position/pose over time. "this results in a \"pose drifting\" effect"
- pose space coverage: A measure of how uniformly a dataset spans the space of possible human poses. "Pose space coverage is the Shannon entropy of frame-to-cluster assignments (over PCA clusters of the shared pose-joint space), normalized so 100\% indicates uniform coverage."
- PVE: Per-Vertex Error; average distance between predicted and reference mesh vertices. "In addition to the standard metrics of MPJPE, PA-MPJPE, and PVE"
- reprojection error: Difference between observed image points and projections of corresponding 3D points under a camera model. "we minimize reprojection error to get extrinsics:"
- SLAM: Simultaneous Localization and Mapping; estimating camera trajectory and a map of the environment from sensor data. "most approaches run SLAM or camera calibration steps and learn body priors."
- SMPL: Skinned Multi-Person Linear model; a parametric 3D human body model for pose and shape. "Reconstructing human poses from images and videos is typically formulated as estimating SMPL \cite{black2015smpl} or SMPL-X \cite{pavlakos2019smplx} parameters"
- SMPL-X: An extension of SMPL that jointly models body, hands, and face. "Reconstructing human poses from images and videos is typically formulated as estimating SMPL \cite{black2015smpl} or SMPL-X \cite{pavlakos2019smplx} parameters"
- SO(3): The mathematical group of 3D rotations (3×3 orthonormal matrices with determinant 1). "extrinsics ."
- stability: A physical-consistency metric indicating balance based on center-of-mass relative to the support polygon. "We further propose two novel performance metrics -- footwork and stability --"
- time-of-flight sensors: Depth sensors that measure distance via the travel time of emitted signals. "with synchronized cameras and time-of-flight sensors."
- triangulating pseudo-ground-truth: Estimating 3D labels from multiple views as a substitute for true ground-truth measurements. "we leverage multi-view disagreement as a label-free error metric instead of triangulating pseudo-ground-truth"
- world coordinates: A global reference frame shared across cameras into which reconstructions are lifted. "Current video-based methods reconstruct human motion in global world coordinates"
Collections
Sign up for free to add this paper to one or more collections.