You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions (1904.09882v2)

Published 22 Apr 2019 in cs.CV

Abstract: The body pose of a person wearing a camera is of great interest for applications in augmented reality, healthcare, and robotics, yet much of the person's body is out of view for a typical wearable camera. We propose a learning-based approach to estimate the camera wearer's 3D body pose from egocentric video sequences. Our key insight is to leverage interactions with another person---whose body pose we can directly observe---as a signal inherently linked to the body pose of the first-person subject. We show that since interactions between individuals often induce a well-ordered series of back-and-forth responses, it is possible to learn a temporal model of the interlinked poses even though one party is largely out of view. We demonstrate our idea on a variety of domains with dyadic interaction and show the substantial impact on egocentric body pose estimation, which improves the state of the art. Video results are available at http://vision.cs.utexas.edu/projects/you2me/

Authors (4)

Evonne Ng (8 papers)
Donglai Xiang (17 papers)
Hanbyul Joo (37 papers)
Kristen Grauman (136 papers)

Citations (77)

View on Semantic Scholar

Summary

The paper presents a novel approach that leverages first- and second-person interactions to infer the largely unseen 3D body pose in egocentric video.
It uses a recurrent neural network that integrates dynamic motion, static scene features, and inferred second-person poses for robust estimation.
Experimental results on Panoptic Studio and Kinect datasets show reduced average joint errors to as low as 8.6 cm, outperforming existing methods.

Body Pose Estimation in Egocentric Video through First and Second Person Interactions

The paper "You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions" presents a novel approach to estimate the 3D body pose of individuals wearing a camera from egocentric video streams. This research addresses the significant challenge posed by the limited visibility of the body in wearable camera setups. The key contribution lies in leveraging the interactions between the camera wearer and another individual to infer the hidden pose of the wearer, thus advancing the state of the art in egocentric video analysis.

Overview

The primary focus of this research is to develop a learning-based model that can estimate the 3D pose of a person wearing a chest-mounted camera. The approach, termed "You2Me," is predicated on the observation that human interactions often involve synchronized poses between individuals. By modeling the interactions between the first person (camera wearer) and the second person (interactee), the system can infer the largely unseen first-person body pose.

The proposed approach utilizes a recurrent neural network (RNN) architecture to learn a temporal model of these interactions. The model incorporates three critical inputs: dynamic motion features from the egocentric video, static scene features, and inferred second-person body poses. By integrating these inputs, the system produces robust predictions of the camera wearer's pose even when it is largely out of the field of view.

Numerical Results

The effectiveness of the You2Me approach is demonstrated through experiments conducted on two datasets, captured using the Panoptic Studio and Kinect sensors. The model significantly outperforms the current state-of-the-art methods, including Ego-pose motion graph, third-person pose networks, and several baseline models. The average joint errors are notably reduced, achieving as low as 8.6 cm in the Panoptic Studio setting, showcasing the strength of interaction-based pose inference.

Implications

This research has significant implications for various fields, including augmented reality, healthcare, and human-robot interaction. By enabling accurate estimation of body pose from egocentric video, the work presents opportunities for enhancing remote therapy sessions, improving human-robot collaboration, and enriching immersive experiences in AR applications. Furthermore, the approach opens the door to more sophisticated models of human interaction and pose estimation by integrating contextual and interactive cues in egocentric settings.

Future Directions

The success of You2Me paves the way for further exploration into interaction-based egocentric video analysis. Future developments may focus on accommodating scenarios involving multiple interactees, handling occlusions more effectively, and exploring the reciprocal benefits of ego-pose estimates on second-person pose estimation. Additionally, advancements in this domain could lead to more generalized models capable of adapting to diverse environmental settings and interaction dynamics.

This work signifies a step forward in the domain of egocentric video processing, illustrating the value of inter-person dynamics for inferring non-visible pose information, and laying the groundwork for future innovations in wearable technology and 3D pose estimation.

PDF Markdown

Related Papers

YouTube

Show All Videos