Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PoseVocab: Learning Joint-structured Pose Embeddings for Human Avatar Modeling (2304.13006v2)

Published 25 Apr 2023 in cs.CV

Abstract: Creating pose-driven human avatars is about modeling the mapping from the low-frequency driving pose to high-frequency dynamic human appearances, so an effective pose encoding method that can encode high-fidelity human details is essential to human avatar modeling. To this end, we present PoseVocab, a novel pose encoding method that encourages the network to discover the optimal pose embeddings for learning the dynamic human appearance. Given multi-view RGB videos of a character, PoseVocab constructs key poses and latent embeddings based on the training poses. To achieve pose generalization and temporal consistency, we sample key rotations in $so(3)$ of each joint rather than the global pose vectors, and assign a pose embedding to each sampled key rotation. These joint-structured pose embeddings not only encode the dynamic appearances under different key poses, but also factorize the global pose embedding into joint-structured ones to better learn the appearance variation related to the motion of each joint. To improve the representation ability of the pose embedding while maintaining memory efficiency, we introduce feature lines, a compact yet effective 3D representation, to model more fine-grained details of human appearances. Furthermore, given a query pose and a spatial position, a hierarchical query strategy is introduced to interpolate pose embeddings and acquire the conditional pose feature for dynamic human synthesis. Overall, PoseVocab effectively encodes the dynamic details of human appearance and enables realistic and generalized animation under novel poses. Experiments show that our method outperforms other state-of-the-art baselines both qualitatively and quantitatively in terms of synthesis quality. Code is available at https://github.com/lizhe00/PoseVocab.

Citations (35)

Summary

  • The paper introduces PoseVocab, a joint-structured pose encoding method that decomposes global pose information into individual joint components for improved dynamic appearance modeling.
  • It employs feature line representations and a hierarchical query strategy to synthesize nuanced, temporally consistent animations, surpassing existing techniques.
  • Empirical results demonstrate significant gains in PSNR and SSIM, capturing detailed garment wrinkles and dynamic textures for practical digital avatar applications.

Overview of "PoseVocab: Learning Joint-structured Pose Embeddings for Human Avatar Modeling"

The paper presents PoseVocab, a novel methodology for encoding human poses, aimed primarily at improving the fidelity and realism of animatable human avatars. This work focuses on a significant challenge in avatar modeling: encoding the low-frequency input poses into high-frequency dynamic appearances.

Methodology

PoseVocab introduces a joint-structured pose encoding approach that seeks to decompose the complex problem of pose-dependent dynamics into manageable components. The process begins with the analysis of multi-view RGB videos of subjects to construct key poses and corresponding latent embeddings. These embeddings are encapsulated within a pose vocabulary, termed PoseVocab. The innovation here lies in joint-structured pose embeddings, an approach that partitions global pose information into individual joint components, allowing more precise modeling of dynamic appearance changes at each joint.

Crucial to PoseVocab’s efficacy is the feature line representation of pose embeddings, which serves to enhance memory efficiency while maintaining a powerful representation capacity. The authors also introduce a hierarchical query strategy that interpolates pose embeddings, allowing for the synthesis of nuanced and temporally consistent human animations. This involves decomposing the human body into so(3)so(3) rotations per joint and interpolating between key rotations.

Experimental Results

The authors provide strong empirical evidence for PoseVocab’s superior performance over existing methods, such as SCANimate, SMPL-based approaches, and Ani-NeRF, in capturing dynamic and detailed human appearances. Quantitatively, PoseVocab outperforms these methods in metrics including PSNR and SSIM, suggesting significant advancements in visual quality and fidelity. Qualitatively, PoseVocab demonstrates enhanced detail in garment wrinkles and dynamic textures, effectively closing the gap between captured input and modeled output under both known and novel poses.

Theoretical and Practical Implications

Theoretically, this research contributes to the broader understanding of pose encoding in parametric character modeling. It illustrates how localized encoding of pose information can lead to superior performance in capturing dynamic appearances. The introduction of feature lines and a hierarchical querying strategy enhances the adaptability and detail of animated avatars.

Practically, PoseVocab holds potential applications across various industries reliant on digital character modeling, from video game production to virtual reality environments and cinematic effects. Its ability to generalize well to novel poses while maintaining high fidelity makes it a valuable tool for animators and digital artists striving for realism and efficiency.

Future Directions

Future developments could explore the extension of PoseVocab to more complex clothing models, such as those involving loose garments, which present their own unique modeling challenges. Integrating PoseVocab into mixed reality frameworks, leveraging real-time streaming data, and enhancing computational efficiency further remain promising avenues for exploration.

In summary, PoseVocab represents a significant advancement in human avatar modeling, providing an innovative solution for encoding pose-driven dynamics. Its introduction enriches both the theoretical landscape of computer graphics and its practical applications in creating highly realistic digital humans.