AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation (2410.07164v1)

Published 9 Oct 2024 in cs.CV

Abstract: Recent advancements in diffusion models have led to significant improvements in the generation and animation of 4D full-body human-object interactions (HOI). Nevertheless, existing methods primarily focus on SMPL-based motion generation, which is limited by the scarcity of realistic large-scale interaction data. This constraint affects their ability to create everyday HOI scenes. This paper addresses this challenge using a zero-shot approach with a pre-trained diffusion model. Despite this potential, achieving our goals is difficult due to the diffusion model's lack of understanding of ''where'' and ''how'' objects interact with the human body. To tackle these issues, we introduce AvatarGO, a novel framework designed to generate animatable 4D HOI scenes directly from textual inputs. Specifically, 1) for the ''where'' challenge, we propose LLM-guided contact retargeting, which employs Lang-SAM to identify the contact body part from text prompts, ensuring precise representation of human-object spatial relations. 2) For the ''how'' challenge, we introduce correspondence-aware motion optimization that constructs motion fields for both human and object models using the linear blend skinning function from SMPL-X. Our framework not only generates coherent compositional motions, but also exhibits greater robustness in handling penetration issues. Extensive experiments with existing methods validate AvatarGO's superior generation and animation capabilities on a variety of human-object pairs and diverse poses. As the first attempt to synthesize 4D avatars with object interactions, we hope AvatarGO could open new doors for human-centric 4D content creation.

Authors (5)

Yukang Cao (13 papers)
Liang Pan (93 papers)
Kai Han (184 papers)
Kwan-Yee K. Wong (51 papers)
Ziwei Liu (368 papers)

Citations (1)

View on Semantic Scholar

Summary

Overview of AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation

The paper "AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation" presents a novel approach for synthesizing 4D avatars with human-object interactions. Traditional methods in this domain have largely been constrained by SMPL-based motion generation models, which struggle due to a lack of expansive and realistic interaction datasets. To tackle these challenges, the authors introduce AvatarGO, a framework that employs a pre-trained diffusion model for zero-shot generation of animatable 4D human-object interaction scenes from textual descriptions.

Key Innovations

LLM-guided Contact Retargeting: To address the challenge of accurately depicting "where" objects interact with the human body, the paper proposes using LLMs to guide contact retargeting. By integrating Lang-SAM, the model identifies the contact points on the human body informed by text prompts, effectively enhancing the precision of spatial relations between humans and objects.
Correspondence-aware Motion Optimization: For the "how" challenge, the framework introduces a motion optimization technique that utilizes the linear blend skinning (LBS) function of SMPL-X. This allows the framework to create motion fields for both human and object models, enhancing the coherence of generated movements and reducing issues such as penetration.

The authors highlight robust performance through extensive experiments, showing AvatarGO's superiority in generating varied human-object pairs and handling diverse poses effectively. Unlike previous methods, which often struggled with realistic depiction and motion coherency, AvatarGO offers a resilient approach to capture complex interactions.

Implications and Future Developments

The implications of AvatarGO's contributions are broad, offering significant advancements for industries like AR/VR and game development, where realistic 4D human-object interactions are pivotal. By addressing both spatial and interaction challenges, it opens avenues for creating more lifelike virtual worlds with minimal human modeling effort.

Future developments may focus on expanding the framework's capabilities to include non-rigid object interactions and scenarios where continuous contact is not applicable, such as independent moving objects (e.g., dribbling a basketball). Additionally, further refinement of LLMs and diffusion models could lead to even more precise and dynamic interaction generation.

Conclusion

AvatarGO marks a significant progression in the field of 4D human-object interaction generation by successfully addressing existing limitations with innovative solutions. Its implementation of LLM-guided contact retargeting and correspondence-aware motion optimization sets a new standard in generating animatable, realistic 4D avatars. This work not only enhances current methodologies but also provides a strong foundation for future exploration and application in AI-driven content creation.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/yukangcao/status/1844219069823844560

https://twitter.com/liuziwei7/status/1899475401627599176

https://twitter.com/WilliamLamkin/status/1844235678676336916