Unified Human-Scene Interaction via Prompted Chain-of-Contacts (2309.07918v5)

Published 14 Sep 2023 in cs.CV

Abstract: Human-Scene Interaction (HSI) is a vital component of fields like embodied AI and virtual reality. Despite advancements in motion quality and physical plausibility, two pivotal factors, versatile interaction control and the development of a user-friendly interface, require further exploration before the practical application of HSI. This paper presents a unified HSI framework, UniHSI, which supports unified control of diverse interactions through language commands. This framework is built upon the definition of interaction as Chain of Contacts (CoC): steps of human joint-object part pairs, which is inspired by the strong correlation between interaction types and human-object contact regions. Based on the definition, UniHSI constitutes a LLM Planner to translate language prompts into task plans in the form of CoC, and a Unified Controller that turns CoC into uniform task execution. To facilitate training and evaluation, we collect a new dataset named ScenePlan that encompasses thousands of task plans generated by LLMs based on diverse scenarios. Comprehensive experiments demonstrate the effectiveness of our framework in versatile task execution and generalizability to real scanned scenes. The project page is at https://github.com/OpenRobotLab/UniHSI .

Citations (43)

View on Semantic Scholar

Summary

The paper introduces UniHSI, redefining interaction as chains of contacts for versatile, language-driven task planning.
It integrates an LLM Planner with a Unified Controller to convert natural language into precise, executable contact events.
Experimental results show enhanced success rates and adaptability in complex, multi-object scenarios compared to legacy frameworks.

Unified Human-Scene Interaction via Prompted Chain-of-Contacts: An Expert Analysis

This paper introduces a novel Human-Scene Interaction (HSI) framework, UniHSI, which addresses the need for versatile interaction control and a user-friendly interface in applications such as embodied AI and virtual reality. Despite previous advancements in motion quality and physical plausibility, the authors identify critical shortcomings in existing HSI frameworks, primarily their limited adaptability and restrictive interfaces. UniHSI leverages a unified definition of interaction, termed Chain of Contacts (CoC), to allow diverse interaction control via language commands and thus offers a scalable and practical solution.

Framework Composition and Methodology

UniHSI comprises two main components: a LLM Planner and a Unified Controller. The framework redefines interaction as a sequence of human joint-object contact events, termed CoC, which exploits the correlation between interaction types and human-object contact regions. This structured formalism facilitates the generation of interaction plans from language commands, which are then executed through a uniform control mechanism.

LLM Planner: Utilizes a LLM to translate natural language prompts into detailed task plans formatted as CoC. This component benefits from prompt engineering techniques, using language to seamlessly bridge high-level user commands and low-level task specifications without requiring annotated datasets.
Unified Controller: Employs the Adversarial Motion Priors framework to model realistic motion sequences while ensuring physical plausibility. The controller relies on a TaskParser to interpret CoC into uniform task observations, dynamically adjusting contact weights for optimal interaction execution. This component also incorporates an ego-centric heightmap for collision awareness and navigation proficiency, extending interaction generalizability to real-world scenarios like those found in the ScanNet dataset.

UniHSI supports long-horizon, multi-step interactions, adapts to multi-object scenarios, and provides fine-granularity control, addressing limitations of previous methods that often restrict horizon length or require task-specific controllers.

Experimental Validation

The effectiveness of UniHSI is evaluated using a newly developed dataset, ScenePlan, which features diverse interaction plans generated by the LLM from objects and scenarios derived from PartNet and ScanNet datasets. Performance metrics such as Success Rate and Contact Error demonstrate the framework's capability, especially in complex, multi-object scenarios. Ablation studies showcase the importance of components like Adaptive Contact Weights and ego-centric heightmaps, underscoring the system's architecture's strength.

Results and Implications

UniHSI achieves high adaptability and robustness, with significant improvements in interaction versatility and task execution efficiency compared to legacy systems. The unified design enables easier and more efficient multi-task learning, setting the stage for future developments in scalable, language-driven HSI systems. By utilizing LLMs for initial interaction planning and simplifying the data annotation process, the methodology introduces a practical approach for large-scale deployment in both simulated and real-world environments.

Future Directions

While UniHSI effectively manages stationary object interactions, future work could expand its capabilities to include dynamic interactions with movable objects, enhancing the system's realism and applicability. Furthermore, integrating LLMs into the runtime environment, rather than as a pre-processing step, might offer even greater adaptability and real-time responsiveness to unforeseen interaction patterns.

In conclusion, using LLM-guided plans and a unified interaction controller, UniHSI offers a robust, scalable solution to complex HSI challenges, paving the way for further exploration into holistic, language-driven interaction systems in AI and robotics. The paper's findings encourage ongoing research into more sophisticated interaction models, potentially driving significant advancements in embodied AI and related fields.

PDF Markdown

Related Papers

GitHub

GitHub - OpenRobotLab/UniHSI: [ICLR 2024 Spotlight] Unified Human-Scene Interaction via Prompted Chain-of-Contacts (220 stars)