Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments (1903.05690v2)

Published 13 Mar 2019 in cs.CV

Abstract: Affordance modeling plays an important role in visual understanding. In this paper, we aim to predict affordances of 3D indoor scenes, specifically what human poses are afforded by a given indoor environment, such as sitting on a chair or standing on the floor. In order to predict valid affordances and learn possible 3D human poses in indoor scenes, we need to understand the semantic and geometric structure of a scene as well as its potential interactions with a human. To learn such a model, a large-scale dataset of 3D indoor affordances is required. In this work, we build a fully automatic 3D pose synthesizer that fuses semantic knowledge from a large number of 2D poses extracted from TV shows as well as 3D geometric knowledge from voxel representations of indoor scenes. With the data created by the synthesizer, we introduce a 3D pose generative model to predict semantically plausible and physically feasible human poses within a given scene (provided as a single RGB, RGB-D, or depth image). We demonstrate that our human affordance prediction method consistently outperforms existing state-of-the-art methods.

PDF Abstract

Overview of Affordance Prediction in 3D Indoor Environments

The paper "Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments" presents a method for predicting human affordances in 3D interior spaces. It primarily focuses on modeling what human poses are facilitated by a given indoor scene, taking into account its semantic and geometric structure. The authors synthesize a large-scale dataset of 3D poses through an automated process that integrates semantic knowledge from 2D poses in TV shows with 3D data from voxel representations of indoor environments.

Key components of the paper include the development of a fully automatic 3D pose synthesizer and a 3D pose generative model. The former generates diverse 3D pose samples by leveraging semantic information extracted from 2D video datasets and aligning them with geometrically feasible positions in voxelized environments. The latter uses this synthesized data to train an end-to-end generative model that predicts plausible human poses within an image of a scene. The authors propose a two-stage approach: first generating 3D poses in indoor scenes, and second predicting human poses in a contextually aware manner. A geometry-aware discriminator further refines the prediction, ensuring adherence to physical rules such as the non-intersection of human poses with solid objects.

Strong Numerical Results and Claims

The paper demonstrates that the proposed method consistently outperforms existing methods regarding human affordance prediction. Quantitative evaluation shows substantial improvements in terms of semantic and geometric plausibility of poses. Semantic scores, indicating the acceptance of poses as realistic human poses, are notably higher with the authors' approach (91.69% with RGB input) compared to the baseline methods. Geometry scores, reflecting the feasibility of poses in terms of scene interaction, also show significant improvements (66.40% with RGB input).

Implications and Future Directions

This paper has practical implications for fields such as robotics and virtual reality, where understanding and predicting human interaction with environments are fundamental. The ability to predict where and how humans fit into a scene can enhance human-robot interaction scenarios and improve the realism of characters in virtual environments.

From a theoretical standpoint, the paper contributes to semantic and geometric understanding through deep learning. It highlights the importance of integrating multiple data sources (2D semantic information and 3D geometric data) to improve model performance. Future work can explore further improvements in predicting interactions in complex scenes, accounting for more dynamic elements and using these predictions to simulate interactions for different applications. Expanding on the geometry-aware methodology could also lead to stronger models for predicting human-environment interactions beyond static scenes.

In conclusion, the paper presents a robust methodology for predicting human affordances, demonstrating significant enhancements over prior techniques. The synthesized datasets and model architectures are commendable contributions toward strengthening the connection between humans and 3D environments in computational models.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Xueting Li (32 papers)
Sifei Liu (64 papers)
Kihwan Kim (67 papers)
Xiaolong Wang (243 papers)
Ming-Hsuan Yang (377 papers)
Jan Kautz (215 papers)

Citations (98)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos