Scaling Up Dynamic Human-Scene Interaction Modeling (2403.08629v2)

Published 13 Mar 2024 in cs.CV

Abstract: Confronting the challenges of data scarcity and advanced motion synthesis in human-scene interaction modeling, we introduce the TRUMANS dataset alongside a novel HSI motion synthesis method. TRUMANS stands as the most comprehensive motion-captured HSI dataset currently available, encompassing over 15 hours of human interactions across 100 indoor scenes. It intricately captures whole-body human motions and part-level object dynamics, focusing on the realism of contact. This dataset is further scaled up by transforming physical environments into exact virtual models and applying extensive augmentations to appearance and motion for both humans and objects while maintaining interaction fidelity. Utilizing TRUMANS, we devise a diffusion-based autoregressive model that efficiently generates HSI sequences of any length, taking into account both scene context and intended actions. In experiments, our approach shows remarkable zero-shot generalizability on a range of 3D scene datasets (e.g., PROX, Replica, ScanNet, ScanNet++), producing motions that closely mimic original motion-captured sequences, as confirmed by quantitative experiments and human studies.

References (65)

Citations (21)

View on Semantic Scholar

Summary

The paper introduces TRUMANS—a large-scale, motion-captured dataset—and a novel autoregressive model to synthesize realistic human-scene interactions.
It employs a diffusion-based autoregressive algorithm conditioned on scene context and action intentions to generate dynamic sequences.
Quantitative experiments and human studies validate the method's performance in closely replicating authentic motion-capture data.

Scaling Up Dynamic Human-Scene Interaction Modeling

The subject of this paper involves the paper of Human-Scene Interaction (HSI) modeling, particularly addressing the challenges of data scarcity and advanced motion synthesis. The paper introduces an extensive dataset called TRUMANS and a novel motion synthesis method, both contributing significantly to the field of HSI.

The TRUMANS dataset is described as the most comprehensive motion-captured HSI dataset currently available. It encompasses over 15 hours of human interactions across 100 diverse indoor scenes. Motion capture in this dataset includes whole-body human motions and part-level object dynamics, emphasizing the realism of contact. This dataset is further scaled by transforming physical environments into accurate virtual models and applying augmentations to both human and object appearance and motion, maintaining interaction fidelity.

The TRUMANS dataset serves as a foundation for a novel computational model employing a diffusion-based autoregressive algorithm for generating HSI sequences of any length. This model is conditioned on scene context and action intentions, demonstrating remarkable zero-shot generalizability on various 3D scene datasets such as PROX, Replica, ScanNet, and ScanNet++. Quantitative experiments and human studies affirm the model's efficacy, closely mimicking motion-captured data sequences.

A structured review of related work reveals the field's limitations, notably the scarcity of high-quality datasets with realistic HSI. Previous datasets like PiGraphs and PROX initiated exploration but faced constraints in scalability and data quality. MoCap datasets, though high in quality, often lacked environmental interaction diversity. Recent developments in synthetic datasets offered cost efficiencies and adaptability but struggled to fully capture realistic 3D HSIs, especially in dynamic contacts and object tracking.

The introduction of the TRUMANS dataset marks a substantial advance. It provides accurate HSI modeling through extensive motion capture and photorealistic rendering, promising enhancements in human pose and contact estimation tasks. Moreover, a diffusion-based autoregressive motion synthesis method is proposed, generating HSIs conditioned on both 3D scene and action labels. This method excels in producing physically plausible and controllable HSI, achieving arbitrary sequence lengths in real time.

Empirical evaluations highlight the effectiveness of both the dataset and the proposed synthesis method. In static settings, the model trained on TRUMANS surpasses baselines with significant improvements in motion synthesis metrics like contact and penetration. The dynamic setting exhibits notable performance in handling human-object interactions. A human paper further corroborates the superiority of the model, as participants often failed to distinguish the synthesized data from actual motion capture.

The implications of this research are notable both theoretically and practically. The introduction of TRUMANS and the corresponding motion synthesis method significantly elevates the quality of HSI modeling. This advancement fosters better generalization in novel environments and suggests potential improvements in related fields such as vision-based perception and interaction modeling. Future developments could see expanded applicability of this fundamental research in various AI-driven tasks requiring sophisticated interaction understanding and prediction.

Conclusively, this paper presents valuable contributions to the field of HSI, offering a robust dataset and innovative method that marks a step forward in modeling complex human-scene interactions. The research promises to serve as a foundation for continued exploration and enhancement of human interaction capabilities within virtual environments.

PDF Markdown

Tweets

https://twitter.com/_akhaliq/status/1768132939206836671

https://twitter.com/WilliamLamkin/status/1768238660208439497

Scaling Up Dynamic Human-Scene Interaction Modeling (2403.08629v2)

Summary

Scaling Up Dynamic Human-Scene Interaction Modeling

Related Papers

Tweets