Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens (2303.00298v1)

Published 1 Mar 2023 in cs.CV

Abstract: In this paper we present a novel method to estimate 3D human pose and shape from monocular videos. This task requires directly recovering pixel-alignment 3D human pose and body shape from monocular images or videos, which is challenging due to its inherent ambiguity. To improve precision, existing methods highly rely on the initialized mean pose and shape as prior estimates and parameter regression with an iterative error feedback manner. In addition, video-based approaches model the overall change over the image-level features to temporally enhance the single-frame feature, but fail to capture the rotational motion at the joint level, and cannot guarantee local temporal consistency. To address these issues, we propose a novel Transformer-based model with a design of independent tokens. First, we introduce three types of tokens independent of the image feature: \textit{joint rotation tokens, shape token, and camera token}. By progressively interacting with image features through Transformer layers, these tokens learn to encode the prior knowledge of human 3D joint rotations, body shape, and position information from large-scale data, and are updated to estimate SMPL parameters conditioned on a given image. Second, benefiting from the proposed token-based representation, we further use a temporal model to focus on capturing the rotational temporal information of each joint, which is empirically conducive to preventing large jitters in local parts. Despite being conceptually simple, the proposed method attains superior performances on the 3DPW and Human3.6M datasets. Using ResNet-50 and Transformer architectures, it obtains 42.0 mm error on the PA-MPJPE metric of the challenging 3DPW, outperforming state-of-the-art counterparts by a large margin. Code will be publicly available at https://github.com/yangsenius/INT_HMR_Model

Citations (11)

View on Semantic Scholar

Summary

The paper presents a Transformer model that leverages independent tokens to decouple joint rotations, body shape, and camera parameters for direct sequence-to-sequence SMPL predictions.
It employs specialized temporal modeling to capture consistent joint rotational patterns, significantly reducing jitter and ensuring smooth motion in video outputs.
Empirical results on the 3DPW and Human3.6M datasets demonstrate improved performance, with a PA-MPJPE of 42.0mm highlighting the model's efficacy.

Analysis of 3D Human Pose and Shape Estimation with Independent Tokens

The paper introduces a Transformer-based model for the task of estimating 3D human pose and shape from monocular videos. The approach combines the strengths of independent token-based representations and temporal modeling to tackle challenges inherent in the task, such as ambiguity and temporal consistency.

Methodology Overview

The proposed model architecture introduces novel independent tokens which serve to encode different aspects of human body modeling:

Joint Rotation Tokens: These tokens capture the 3D rotational data of each human skeletal joint, including global rotations.
Shape Token: A single token encodes the overall body shape, allowing the model to predict shape parameters.
Camera Token: Encodes information related to the camera parameters, including scale and translation for 2D projection.

The approach avoids the dependency on conventional parameter regression methods that require prior mean pose initialization and iterative error feedback. Instead, these tokens are designed to progressively interact with image features through the Transformer model, enabling the model to learn from large-scale training data without iterative refinement, and to predict SMPL parameters directly as a sequence-to-sequence task.

Temporal Modeling

To maintain temporal coherence in videos, the paper implements a specialized temporal Transformer model focusing on joint rotational sequences. By modeling each joint independently over time, the model attends to specific rotational patterns of motion, which empirically improves joint stability and reduces jitter in video outputs. This is observed in the superior perceptual smoothness and fidelity of motion in qualitative evaluations.

Experimental Results

The proposed model shows significant performance improvements over existing state-of-the-art methods. On the 3DPW dataset, the model achieves a PA-MPJPE of 42.0mm, indicating a notable improvement which underscores the efficacy of independent token representations and their ability to capture both spatial and temporal nuances effectively. Furthermore, it achieves a competitive performance on Human3.6M, demonstrating the model's robustness across different datasets. The impact of these independent tokens is particularly pronounced in the reduction of local joint jitters in video results.

Theoretical and Practical Implications

This token-based approach presents several theoretical and practical implications:

Decoupled Representation: By utilizing tokens to represent distinct joint rotations independent of input images, the model demonstrates the feasibility of detaching model representation from direct image feature dependency.
Temporal Robustness: The model’s ability to learn and utilize rotational temporal patterns suggests a path forward for more stable motion representations in computer vision tasks.
Generalization Across Contexts: Despite differences in dataset characteristics, the approach maintains strong performance metrics, indicating robust generalizability.

Future Directions

Future research could explore:

Extension to Multi-modal Inputs: Incorporating additional input channels such as depth or infrared could be explored using the token-based framework.
Expanded Token Designs: Introducing additional or hierarchical tokens might provide more granular control over specific features or body regions.
Real-time Applications: Optimizing the model further for computational efficiency could enable practical real-time applications in AR/VR and body dynamics analysis.

The paper's contribution establishes a new benchmark for 3D human pose and shape estimation in monocular video settings, leveraging Transformer-based tokens to achieve significant advancements in both precision and temporal consistency. This represents a meaningful step forward in human motion modeling within computer vision, with potential applications across digital content creation, fitness analysis, and interactive applications.

PDF Markdown

Related Papers

GitHub

GitHub - yangsenius/INT_HMR_Model: Capturing the Motion of Every Joint: 3D Human Pose and Shape Estimation with Independent Tokens. ICLR2023 (spotlight) (60 stars)

YouTube

Show All Videos