InterDreamer: Zero-Shot Text to 3D Dynamic Human-Object Interaction Generation Framework
Introduction
The synthesis of human motion conditioned on textual descriptions has made strides with diffusion models trained on extensive datasets with motion capture data and textual annotations. Despite this advancement, the leap towards generating 3D dynamic human-object interactions (HOIs) from textual inputs remains a challenge, primarily due to the rarity of large-scale interaction data fully annotated with detailed descriptions. "InterDreamer" is a novel framework designed to generate text-aligned 3D HOI sequences. Leveraging the decoupling of interaction semantics from dynamics, it utilizes a combination of pre-trained LLMs and a world model to comprehend simple physics, enabling the generation of realistic interactions without direct training on text-to-interaction paired data.
Methodology
High-Level Planning
InterDreamer begins with semantic analysis of a textual description using LLMs to extract high-level interaction goals, including targeted objects and the nature of interaction. This process reformulates textual descriptions to close the distributional gap between text and the subsequent model's understanding, aligning generated human motion and object interaction with textual guidance more closely.
Low-Level Control
For generating initial human motion and object interaction states, InterDreamer integrates a text-to-motion model and an interaction retrieval model. These components process semantic information derived from LLMs to produce initial human poses and object states that are semantically aligned with the target interaction text, setting the stage for dynamic interaction rollouts.
World Model
InterDreamer proposes a novel world model, which, through the decoupled dynamics learned from motion capture data, predicts future states of objects influenced by human interaction. Notably, the interaction dynamics are steered by vertex-level control over sampled contact regions on the human body, permitting the dynamics model to forecast object motion by focusing on areas of actual human-object contact.
Experimental Results
InterDreamer is extensively evaluated on the BEHAVE and CHAIRS datasets, demonstrating its ability to generate coherent and realistic interaction sequences that closely follow the textual directives. The framework's efficacy in zero-shot text-to-HOI generation is compared against various baselines, showing substantial improvements in capturing the nuances of realistic interactions.
Implications and Future Directions
InterDreamer represents a leap towards intuitive and expressive methods for generating dynamic 3D human-object interactions directly from textual descriptions. Its innovative approach to decoupling interaction semantics and dynamics has broader implications for the development of more generalized and robust AI systems capable of understanding and interacting with the physical world in a human-like manner. The framework opens avenues for future research into more complex interactions, the integration of multi-modal data, and the exploration of advanced training strategies to further enhance the quality and diversity of generated interactions.
In conclusion, InterDreamer sets a new precedent for text-guided human-object interaction generation, paving the way for advancements in interactive applications, virtual reality, and animation, among other fields, by enabling more natural and intuitive creation of complex interactive scenes directly from textual descriptions. Its successful leveraging of existing models and novel world model architecture for zero-shot learning showcases the untapped potential of AI in understanding and mimicking complex real-world interactions.