Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 83 tok/s

Gemini 2.5 Pro 42 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 36 tok/s Pro

GPT-4o 108 tok/s Pro

Kimi K2 220 tok/s Pro

GPT OSS 120B 473 tok/s Pro

Claude Sonnet 4 39 tok/s Pro

2000 character limit reached

TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization (2503.19901v2)

Published 25 Mar 2025 in cs.CV

Abstract: Synthesizing diverse and physically plausible Human-Scene Interactions (HSI) is pivotal for both computer animation and embodied AI. Despite encouraging progress, current methods mainly focus on developing separate controllers, each specialized for a specific interaction task. This significantly hinders the ability to tackle a wide variety of challenging HSI tasks that require the integration of multiple skills, e.g., sitting down while carrying an object. To address this issue, we present TokenHSI, a single, unified transformer-based policy capable of multi-skill unification and flexible adaptation. The key insight is to model the humanoid proprioception as a separate shared token and combine it with distinct task tokens via a masking mechanism. Such a unified policy enables effective knowledge sharing across skills, thereby facilitating the multi-task training. Moreover, our policy architecture supports variable length inputs, enabling flexible adaptation of learned skills to new scenarios. By training additional task tokenizers, we can not only modify the geometries of interaction targets but also coordinate multiple skills to address complex tasks. The experiments demonstrate that our approach can significantly improve versatility, adaptability, and extensibility in various HSI tasks. Website: https://liangpan99.github.io/TokenHSI/

Collections

Summary

Analysis of TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization

The paper "TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization" presents a comprehensive and versatile framework aimed at enhancing the synthesis of human-scene interaction (HSI) tasks. This research contributes significantly to the domains of computer animation and embodied AI by proposing a unified transformer-based model capable of integrating multiple interaction skills into a single cohesive system, referred to as TokenHSI.

The authors identify a critical limitation in conventional approaches that often necessitate separate controllers, each tailored for specific HSI tasks. These existing methods lack scalability and flexibility, hindering their applicability in dynamically changing environments and complex task executions. TokenHSI addresses these challenges by adopting a novel task tokenization strategy that models humanoid proprioception as a shared token, seamlessly combined with various task-specific tokens using a masking mechanism.

Methodology and Key Insights

Transformer-Based Policy Architecture: The core innovation in TokenHSI lies in its architecture that leverages transformers to manage variable-length inputs. This flexibility is pivotal for accommodating diverse HSI tasks within a unified model framework. The transformer encoder, equipped with self-attention capabilities, facilitates the fusion of humanoid proprioceptive data with task-related tokens, promoting effective cross-task knowledge sharing.
Proprioception and Task Tokenization: Distinct from prior controllers, the proposed method employs tokenization to encapsulate both humanoid proprioception and the distinct states pertinent to individual tasks. This design choice enhances policy extensibility while maintaining robust training dynamics across multiple HSI tasks.
Multi-Task Training and Adaptation: TokenHSI excels at learning foundational skills such as following, sitting, climbing, and carrying within a single network through multi-task training. Furthermore, the framework supports rapid adaptation to novel HSI tasks by training additional task tokenizers, thereby extending its applicability to complex scenarios that involve skill compositionality and variations in object or terrain shapes.

Experimental Validation and Results

The authors substantiate the versatility and performance of TokenHSI through extensive empirical evaluation. Key findings from these experiments include:

Foundational Skill Mastery: TokenHSI achieves high success rates in core HSI tasks, demonstrating comparable performance to specialist controllers that are trained individually for each task.
Superior Adaptation Efficiency: When adapting to new challenges such as combined task execution (e.g., sitting while carrying) or interacting with non-standard objects, TokenHSI shows remarkable efficiency, requiring only minor adjustments to the task tokens and achieving superior adaptability.
Enhanced Task Completion: The policy's ability to efficiently tackle long-horizon tasks by composing skills learned from different foundational tasks underlines its robustness and scalability.

Implications and Future Directions

TokenHSI's unified approach to HSI significantly impacts both theoretical research and practical applications in simulation environments for animation and robotics. It sets a precedent for future research on integration frameworks that can handle complex human-scene interactions with minimal separate controller dependencies.

In conclusion, TokenHSI epitomizes progress towards developing adaptable, scalable HSI synthesis systems, capable of addressing diverse real-world scenarios. Future research could explore the integration of natural language processing to further automate task token generation, bridging the gap between task definition and execution. Additionally, expanding the model's capabilities to work in more dynamic and cluttered environments remains an avenue for continued advancement.