TRACE: 5D Temporal Regression of Avatars with Dynamic Cameras in 3D Environments (2306.02850v2)

Published 5 Jun 2023 in cs.CV

Abstract: Although the estimation of 3D human pose and shape (HPS) is rapidly progressing, current methods still cannot reliably estimate moving humans in global coordinates, which is critical for many applications. This is particularly challenging when the camera is also moving, entangling human and camera motion. To address these issues, we adopt a novel 5D representation (space, time, and identity) that enables end-to-end reasoning about people in scenes. Our method, called TRACE, introduces several novel architectural components. Most importantly, it uses two new "maps" to reason about the 3D trajectory of people over time in camera, and world, coordinates. An additional memory unit enables persistent tracking of people even during long occlusions. TRACE is the first one-stage method to jointly recover and track 3D humans in global coordinates from dynamic cameras. By training it end-to-end, and using full image information, TRACE achieves state-of-the-art performance on tracking and HPS benchmarks. The code and dataset are released for research purposes.

Citations (44)

View on Semantic Scholar

Summary

The paper presents a novel 5D regression framework that integrates long- and short-term temporal features for robust 3D human pose and trajectory estimation.
It leverages a ConvGRU module with deformable convolutions and comprehensive loss functions, including MPJPE and focal loss, to enhance accuracy under dynamic camera conditions.
Empirical evaluations on multiple datasets demonstrate significant error reductions, such as improved PAMPJPE scores, and strong performance in challenging tracking scenarios.

TRACE: 5D Temporal Regression of Avatars with Dynamic Cameras in 3D Environments

The paper "TRACE: 5D Temporal Regression of Avatars with Dynamic Cameras in 3D Environments" presents a novel approach to jointly perform 3D human motion and trajectory estimation from video data, with a focus on scenarios involving dynamic cameras. The proposed TRACE framework introduces a robust methodology for handling both the spatial and temporal complexities inherent in such data, leveraging advanced loss functions and temporal feature propagation techniques.

Technical Contributions

The TRACE framework is designed to address the limitations of existing methods by effectively integrating long-term and short-term motion features. The core component of the framework is a temporal feature propagation module that combines a ConvGRU module with a residual connection and deformable convolution layers. This allows it to efficiently capture long-term dependencies and short-term dynamics simultaneously, facilitating more accurate 3D pose estimation even under dynamic camera conditions.

Loss Functions

Key to TRACE's performance is its comprehensive loss function design. It goes beyond standard image losses by incorporating focal loss and a set of SMPL parameter losses, which are crucial for supervising both 2D and 3D map estimations. The integration of additional 3D body keypoint losses, such as MPJPE and PMPJE, ensures the alignment and consistency of predicted poses with ground truths after domain adaptation.

Temporal Feature Propagation

The temporal feature propagation mechanism is notable for its use of both ConvGRU for maintaining memory states and a deformable convolution approach to refine feature maps based on dynamic spatial locations. This combination enables TRACE to leverage past and near-term frames to enhance 3D trajectory prediction, maintaining the alignment of occluded subjects.

Empirical Evaluation

TRACE is empirically validated across various datasets, including Human3.6M, 3DMPB, CMU Panoptic, and a custom Dyna3DPW subset, demonstrating its competitive performance in both static and dynamic camera settings. Crucially, TRACE significantly reduces the Procrustes Aligned Mean Per Joint Position Error (PAMPJPE) compared to previous methods, especially on dynamic datasets, achieving results such as 42.0mm on Human3.6M—a notable improvement over existing baselines.

Further evaluations of TRACE's efficacy in tracking scenarios, particularly in handling occlusion and tracking ID switches, corroborate the utility of its memory module in mitigating common challenges faced in dynamic tracking tasks.

Limitations and Future Work

Despite its strengths, TRACE's dependence on specific assumptions, such as fixed camera fields of view and limited body shape diversity in training data, hint at areas for future exploration. Addressing these through richer datasets and refined camera pose estimation methodologies could enhance its application breadth. The prospect of expanding TRACE to track multiple subjects without predefined inputs, possibly incorporating real-world coordinate transformation, represents a promising direction for extending its capabilities.

Conclusion

TRACE sets a new benchmark in the domain of 3D pose and trajectory estimation under dynamic conditions. Through its innovative architecture and sophisticated integration of temporal features, TRACE makes substantial advancements in bridging the gap between static and dynamic 3D human modeling. As the field progresses, the lessons learned from TRACE could inform subsequent efforts in developing more generalized and adaptable human pose estimation systems.

PDF Markdown

Related Papers

YouTube

Show All Videos