HOIDiNi: Human-Object Interaction through Diffusion Noise Optimization

Published 18 Jun 2025 in cs.CV | (2506.15625v1)

Abstract: We present HOIDiNi, a text-driven diffusion framework for synthesizing realistic and plausible human-object interaction (HOI). HOI generation is extremely challenging since it induces strict contact accuracies alongside a diverse motion manifold. While current literature trades off between realism and physical correctness, HOIDiNi optimizes directly in the noise space of a pretrained diffusion model using Diffusion Noise Optimization (DNO), achieving both. This is made feasible thanks to our observation that the problem can be separated into two phases: an object-centric phase, primarily making discrete choices of hand-object contact locations, and a human-centric phase that refines the full-body motion to realize this blueprint. This structured approach allows for precise hand-object contact without compromising motion naturalness. Quantitative, qualitative, and subjective evaluations on the GRAB dataset alone clearly indicate HOIDiNi outperforms prior works and baselines in contact accuracy, physical validity, and overall quality. Our results demonstrate the ability to generate complex, controllable interactions, including grasping, placing, and full-body coordination, driven solely by textual prompts. https://hoidini.github.io.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a novel framework, HOIDiNi, that leverages Diffusion Noise Optimization to synthesize realistic human-object interactions.
It employs a two-phase strategy with an object-centric phase for contact prediction and a human-centric phase for natural full-body motion.
Experimental results on the GRAB dataset demonstrate superior interaction accuracy and motion realism compared to baseline methods.

HOIDiNi: Human-Object Interaction through Diffusion Noise Optimization

The paper introduces HOIDiNi, an innovative framework for generating human-object interactions using a text-driven diffusion model optimized through a novel method referred to as Diffusion Noise Optimization (DNO). Human-object interaction modeling is a complex task due to the necessity of accurate contacts and natural body motion. Existing methods typically compromise realism for precision or vice versa. HOIDiNi's approach delicately balances these aspects by optimizing directly in the noise space of a pretrained diffusion model.

Diffusion Noise Optimization (DNO)

The core of HOIDiNi's approach is the utilization of DNO, which is a test-time sampling method built upon denoising diffusion models. DNO manipulates the noise space to guide the generation towards desired outcomes while maintaining compliance with the inherent distribution learned by the diffusion model. This method is beneficial in navigating complex motion synthesis conditions without sacrificing realism.

Structured Optimization Phases

The paper introduces the concept of two distinct optimization phases to tackle the diverse demands of human-object interaction (HOI):

Object-Centric Phase: This phase primarily focuses on determining discrete contact locations between the hand and the object. It captures the object's motion and contact pairs, creating a structural blueprint which guides full-body motion. Unlike heuristic-based predictions, this model dynamically forecasts contact pairs, ensuring frame-consistent interactions adapting to object shape and motion.
Human-Centric Phase: Here, refinement of the full-body motion takes place, ensuring precise hand-object contact. The motion is steered to adhere to the structured interactions developed in the previous phase, thus preserving natural body posture and synchronizing object dynamics with the human pose.

This bifurcation allows for managing the challenges of discrete and continuous space optimization, thereby improving both contact precision and holistic motion realism without deviation from the learned motion manifold.

Strong Numerical Results and Evaluations

Empirical evaluation on the GRAB dataset demonstrates HOIDiNi’s superior performance in motion realism and interaction accuracy compared to baseline approaches. It consistently achieves lower penetration and floating errors, evidencing greater adherence to physical validity. User studies further validate its plausibility in synthesizing complex interactions driven solely by textual prompts, reflecting a substantial preference over competing methods. The study highlights the effectiveness of predicted contact pairs which significantly enhance realism compared to nearest-neighbor heuristics.

Implications and Future Directions

HOIDiNi enhances the generation of human-object interactions with potential applications in animation, robotics, and virtual reality. Its ability to drive interactions through textual descriptions presents a compelling way to control and automate complex motion synthesis tasks.

Future work may explore extending HOIDiNi to larger datasets, further diversifying interaction scenarios and improving robustness in real-world applications. Integrating HOIDiNi with physical engines or simulating unfamiliar objects are promising directions for increasing generalization capabilities. Additionally, enhancing model efficiency via autoregressive sampling methods or refining optimization constraints holds promise for advancing HOI synthesis speed and quality.

In conclusion, by adeptly merging precision and plausibility, HOIDiNi sets a foundation for advancing human-object interaction synthesis, steering towards more intelligent and interactive digital environments.

Markdown Report Issue