Generative Proxemics: A Prior for 3D Social Interaction from Images (2306.09337v2)

Published 15 Jun 2023 in cs.CV

Abstract: Social interaction is a fundamental aspect of human behavior and communication. The way individuals position themselves in relation to others, also known as proxemics, conveys social cues and affects the dynamics of social interaction. Reconstructing such interaction from images presents challenges because of mutual occlusion and the limited availability of large training datasets. To address this, we present a novel approach that learns a prior over the 3D proxemics two people in close social interaction and demonstrate its use for single-view 3D reconstruction. We start by creating 3D training data of interacting people using image datasets with contact annotations. We then model the proxemics using a novel denoising diffusion model called BUDDI that learns the joint distribution over the poses of two people in close social interaction. Sampling from our generative proxemics model produces realistic 3D human interactions, which we validate through a perceptual study. We use BUDDI in reconstructing two people in close proximity from a single image without any contact annotation via an optimization approach that uses the diffusion model as a prior. Our approach recovers accurate and plausible 3D social interactions from noisy initial estimates, outperforming state-of-the-art methods. Our code, data, and model are availableat our project website at: muelea.github.io/buddi.

Authors (5)

Lea Müller (10 papers)
Vickie Ye (10 papers)
Georgios Pavlakos (45 papers)
Michael Black (17 papers)
Angjoo Kanazawa (84 papers)

Citations (17)

View on Semantic Scholar

Summary

Generative Proxemics: A Prior for 3D Social Interaction from Images

The paper "Generative Proxemics: A Prior for 3D Social Interaction from Images" presents an innovative approach to modeling human social interactions within a three-dimensional space using image inputs. This research focuses on capturing and reconstructing the proxemics—spatial relationships—between individuals engaged in close interactions. The primary method employed is a diffusion model, referred to as BUDDI, designed to generate and utilize 3D social interaction priors for reconstructing interactions directly from image data, without the requirement for manual annotations during testing.

Overview

The authors address a difficult challenge in computer vision: accurately modeling social interactions in 3D from 2D images. A particular emphasis is laid on utilizing proxemics to infer social cues intrinsic to human interactions. The proposed method uses a diffusion-based generative model trained on extensive datasets from image collections with annotations of body contact, alongside MoCap data, to create realistic 3D reconstructions of human interactions.

Key components of their methodology include:

Generative Model: The model is trained on existing datasets such as FlickrCI3D, CHI3D, and Hi4D, which provide annotations for contact and motion capture data. This establishes a basis for the BUDDI diffusion model which learns the distribution of human pose and 3D geometry in social settings.
3D Reconstruction: Utilizing BUDDI as a prior, the method reconstructs 3D human bodies from noisy initial estimates captured from single images. This is achieved through an optimization process that iterates over body parameters to enhance the accuracy and plausibility of human interaction modeling.

Results and Implications

The authors conducted extensive experiments demonstrating the effectiveness of their approach across various datasets, highlighting the accuracy in reconstructing 3D poses as well as achieving plausible proximal distances that align with social interaction scenarios. The use of diffusion models for capturing 3D proxemics is validated through perceptual studies that verify the realism of generative samples.

Numerical results indicate that BUDDI outperforms existing methods, especially in challenging scenarios involving complex human interactions, such as embracing or posing together. Particularly noteworthy is the model's capability to handle interactions that are heavily occluded in the image space, one of the more formidable challenges in visual computational studies.

This work paves the way for further exploration into digital human synthesis, offering a deeper understanding of human social behavior. The multi-faceted approach combining human perception insights, prior knowledge, and state-of-the-art computational methodologies can be further extended to incorporate hand and facial expressions. Additionally, future work could explore more extensive interactions involving more than two individuals and attempt conditioning on diverse inputs like textual descriptions of intended interactions.

Conclusion

The proposed method effectively demonstrates how generative models leveraging diffusion processes can be utilized as priors for reconstructive tasks in computer vision. By focusing on 3D proxemics, the paper extends the possibilities of how AI can autonomously interpret and recreate human social interactions from visual data. This research signifies a meaningful intersection between computerized understanding of human interactions and real-world applications, promising advances in areas such as animation, social robotics, and augmented reality. As such, BUDDI could serve as an instrumental technology within the evolving paradigm of embodied AI systems.

PDF Markdown