HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models (2312.06553v2)

Published 11 Dec 2023 in cs.CV

Abstract: We address the problem of generating realistic 3D human-object interactions (HOIs) driven by textual prompts. To this end, we take a modular design and decompose the complex task into simpler sub-tasks. We first develop a dual-branch diffusion model (HOI-DM) to generate both human and object motions conditioned on the input text, and encourage coherent motions by a cross-attention communication module between the human and object motion generation branches. We also develop an affordance prediction diffusion model (APDM) to predict the contacting area between the human and object during the interactions driven by the textual prompt. The APDM is independent of the results by the HOI-DM and thus can correct potential errors by the latter. Moreover, it stochastically generates the contacting points to diversify the generated motions. Finally, we incorporate the estimated contacting points into the classifier-guidance to achieve accurate and close contact between humans and objects. To train and evaluate our approach, we annotate BEHAVE dataset with text descriptions. Experimental results on BEHAVE and OMOMO demonstrate that our approach produces realistic HOIs with various interactions and different types of objects.

PDF HTML Abstract

Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models

The paper "HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models" presents an approach that addresses the synthesis of realistic 3D human-object interactions (HOIs) guided by textual descriptions. This paper is grounded in the application of denoising diffusion models, which are well-regarded for their aptitude in generating high-quality outputs in varied domains including image synthesis. The authors decompose the intricate problem of HOI generation into distinct modules, thereby facilitating a concise solution to the wider problem space.

Methodological Advances

The proposed methodology, termed HOI-Diff, adopts a modular approach targeting each sub-task within HOI synthesis. The diffusion model lies at the core, operating via a dual-branch architecture named the Human-Object Interaction Diffusion Model (HOI-DM) to generate human and object motions. It employs a cross-attention communication module to ensure coherent generative outputs for both human and object motion paths. A key component is the affordance prediction diffusion model (APDM), which estimates the interacting regions between the human body and objects. This estimation acts independently of the HOI-DM results, thereby providing corrective feedback to recover from potential generative errors and introduce diversity into the contact points.

The correction of interactions guided by the estimated affordance underpins the efficacy of the approach. By integrating estimated contacts into classifier-guidance mechanisms, the method enhances the physical plausibility and semantic fidelity of the interactions by minimizing non-realistic elements such as floating objects.

Experimental Evaluation

Evaluation is undertaken on the BEHAVE dataset, extended with annotations for descriptions to facilitate text-driven synthesis. The model also undergoes testing on the OMOMO dataset, characterized by interactions with objects involving hands. Notably, the experimental results underscore the proposed model's ability to generate diverse and semantically aligned 3D motions that closely adhere to the input descriptive prompts. Metrics such as Fréchet Inception Distance (FID), R-Precision, and Diversity from HumanML3D are employed to quantify the generation quality, supplemented by user studies to qualitatively compare model outputs against baselines.

Implications and Future Directions

The implications of HOI-Diff extend across several virtual reality (VR), augmented reality (AR), and filmmaking applications, where realistic human-object interactions are paramount. By demonstrating its ability to generalize over divergent objects, the research paves the way for future exploration into dynamic and complex interaction scenarios involving multiple objects and humans. Future investigations may further enhance generational accuracy by leveraging sophisticated affordance prediction models pre-trained on extensive 3D object interaction datasets. Such advancements could facilitate enhanced realism in synthesized interactions by seamlessly integrating contextually appropriate physical interactions between humans and objects.

Additionally, while HOI-Diff demonstrates considerable robustness in testing, the inherent limitations such as increased inference costs and the need for extensive computational resources could guide future innovations towards optimized model architectures that mitigate these constraints. The paper establishes a framework grounded in robust methodologies that aspire to bridge the domain between textual descriptions and realistic 3D motion synthesis, setting a significant cornerstone for forthcoming computational advancements in this dimension of AI research.