GOAL: Generating 4D Whole-Body Motion for Hand-Object Grasping (2112.11454v2)

Published 21 Dec 2021 in cs.CV

Abstract: Generating digital humans that move realistically has many applications and is widely studied, but existing methods focus on the major limbs of the body, ignoring the hands and head. Hands have been separately studied, but the focus has been on generating realistic static grasps of objects. To synthesize virtual characters that interact with the world, we need to generate full-body motions and realistic hand grasps simultaneously. Both sub-problems are challenging on their own and, together, the state-space of poses is significantly larger, the scales of hand and body motions differ, and the whole-body posture and the hand grasp must agree, satisfy physical constraints, and be plausible. Additionally, the head is involved because the avatar must look at the object to interact with it. For the first time, we address the problem of generating full-body, hand and head motions of an avatar grasping an unknown object. As input, our method, called GOAL, takes a 3D object, its position, and a starting 3D body pose and shape. GOAL outputs a sequence of whole-body poses using two novel networks. First, GNet generates a goal whole-body grasp with a realistic body, head, arm, and hand pose, as well as hand-object contact. Second, MNet generates the motion between the starting and goal pose. This is challenging, as it requires the avatar to walk towards the object with foot-ground contact, orient the head towards it, reach out, and grasp it with a realistic hand pose and hand-object contact. To achieve this, the networks exploit a representation that combines SMPL-X body parameters and 3D vertex offsets. We train and evaluate GOAL, both qualitatively and quantitatively, on the GRAB dataset. Results show that GOAL generalizes well to unseen objects, outperforming baselines. GOAL takes a step towards synthesizing realistic full-body object grasping.

PDF Abstract

Synthesis of Full-Body Motions for Object Grasping

This paper presents a novel approach for generating digital humans capable of executing realistic full-body movements specifically aimed at grasping unknown 3D objects. The proposed methodology addresses a gap in existing techniques which typically either focus on major body limbs or on isolated hand motions without integrating the nuances of full-body dynamics, object interaction, and head orientation. The central contribution of this work is a dual-network system designed to synthesize comprehensible avatar motions that demonstrate natural walking, reaching, and grasping behaviors given dynamic object and spatial inputs.

Methodological Framework

The research employs a two-stage network framework:

Goal Network (GNet): This module uses a conditional variational auto-encoder to generate a final full-body grasp pose. The inputs to GNet include the 3D object data, its spatial location, and the initial posture of the virtual human. It outputs parameters for the body, head, and hands, predicting realistic grips using a learned distribution of body-object interactions.
Motion Network (MNet): Unionizing with GNet's final grasp pose, MNet autonomously inpaints the intervening motion sequence between the start and goal poses. This involves generating sequential body poses through an auto-regressive model that accounts for contextual interactions and realistic hand-object contacts as the virtual human approaches and interacts with an object.

To refine the network outputs, the paper introduces an optimization layer that utilizes predicted spatial offsets and other interaction heuristics, enhancing the realism and logistical congruity of the synthesized motion.

Numerical Evaluation and Results

Key evaluations were conducted using the GRAB dataset, which features comprehensive expositions of human-object interactions. The GNet was found to exhibit high fidelity in replicating realistic poses, rated on par with recorded ground truths from the dataset, especially following optimization enhancements. The perceptual paper conducted through Amazon Mechanical Turk validated that the results upon optimization improved significantly in terms of motion plausibility, including hand-object interactivity and foot-ground contact.

Notably, the paper reports performance metrics indicating MNet's motions closely approach the realism of genuine human movements, albeit with recognizable improvement needs in foot-sliding reduction, as identified in foot-ground interaction evaluations.

Implications and Future Research

The proposed framework progresses the ability to simulate realistic full-body motions for avatars in interactive 3D spaces, holding substantial implications for gaming, VR, AR, and cinematic applications where virtual presence demands authenticity. The approach could further lay the groundwork for enhanced robotic control simulations where human-like fluidity in interactions is requisite.

Moving forward, extending the capability of the model to cover longer distance approaches and enhancing scene comprehension and interaction are foreseeable advancements. Furthermore, the integration with broader human-environment interaction models could facilitate adaptive motion synthesis, extending the model's ecological validity across varied virtual or augmented reality scenarios.

In summary, this work contributes significantly to the understanding and implementation of holistic human movement synthesis, providing a nuanced and practical advancement in virtual human modeling for grasping interactions.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Omid Taheri (17 papers)
Vasileios Choutas (12 papers)
Michael J. Black (163 papers)
Dimitrios Tzionas (35 papers)

Citations (105)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos