Leveraging Photometric Consistency over Time for Sparsely Supervised Hand-Object Reconstruction (2004.13449v1)

Published 28 Apr 2020 in cs.CV

Abstract: Modeling hand-object manipulations is essential for understanding how humans interact with their environment. While of practical importance, estimating the pose of hands and objects during interactions is challenging due to the large mutual occlusions that occur during manipulation. Recent efforts have been directed towards fully-supervised methods that require large amounts of labeled training samples. Collecting 3D ground-truth data for hand-object interactions, however, is costly, tedious, and error-prone. To overcome this challenge we present a method to leverage photometric consistency across time when annotations are only available for a sparse subset of frames in a video. Our model is trained end-to-end on color images to jointly reconstruct hands and objects in 3D by inferring their poses. Given our estimated reconstructions, we differentiably render the optical flow between pairs of adjacent images and use it within the network to warp one frame to another. We then apply a self-supervised photometric loss that relies on the visual consistency between nearby images. We achieve state-of-the-art results on 3D hand-object reconstruction benchmarks and demonstrate that our approach allows us to improve the pose estimation accuracy by leveraging information from neighboring frames in low-data regimes.

PDF Abstract

Sparsely Supervised Hand-Object Reconstruction via Photometric Consistency

This paper introduces a novel approach to 3D hand-object reconstruction from color images, specifically targeting the challenge of estimating poses in scenes characterized by substantial mutual occlusions that frequently occur during human-object interactions. The primary contribution is a method that leverages photometric consistency across time, enabling a sparsely supervised learning paradigm that reduces dependency on costly and labor-intensive 3D ground-truth annotations.

Summary of the Approach

The proposed method integrates a photometric consistency loss within a framework that receives monocular RGB videos as input and outputs detailed 3D reconstructions of hands and objects. The approach assumes known 3D models of the objects and sparse ground-truth annotations for only select frames. Key components include:

Differentiable Rendering of Optical Flow: The method computes optical flow between consecutive video frames based on the inferred poses of the hands and objects. This flow is utilized to warp frames temporally, enforcing cross-frame photometric consistency.
Self-supervised Photometric Loss: By defining a loss in image space that penalizes discrepancies between real and warped images, the approach effectively propagates information from annotated to unannotated frames.
Joint Hand-Object Reconstruction: The paper employs a feed-forward neural network architecture that outputs dense 3D hand and object models in each frame. This is achieved using MANO for hand pose and shape estimation, coupled with direct regression of object 6D poses.

Evaluation and Results

The method achieves state-of-the-art results on established datasets such as FPHAB and HO-3D, demonstrating significant improvements in accuracy of pose estimation under sparsely annotated training regimes. Notably, the inclusion of photometric consistency was shown to substantially enhance performance when only limited ground-truth annotations were available:

Quantitative Improvement: The paper reports decisive gains in both 2D and 3D pose estimation metrics as the fraction of annotated frames decreases, with the most pronounced benefits observable at under 10% fully supervised data.
Qualitative Analysis: Despite potential limitations due to fast motion or lighting changes, the approach enables robust pose estimation in challenging scenarios with large inter-frame movements, capitalizing on retained visual coherence.

Implications and Future Directions

The implications of this research are multifaceted, encompassing areas such as augmented reality, robotics, and video surveillance where precise hand-object interaction modeling is paramount. By diminishing the reliance on extensive labeled datasets, this sparsely supervised strategy fosters broader applicability and scalability in real-world settings.

The paper suggests potential avenues for future research, including extending the model to incorporate full-body reconstructions and interactions with complex environments, enriching the scope of human-centric scene understanding. Exploring additional self-supervised constraints related to 3D interpenetration and scene interactions may further refine reconstruction quality and generalization.

This work thus presents a significant step towards efficient and effective 3D pose estimation with sparse data, paving the way for more autonomous systems capable of understanding intricate human behaviors with minimal supervision.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Yana Hasson (9 papers)
Bugra Tekin (22 papers)
Federica Bogo (16 papers)
Ivan Laptev (99 papers)
Marc Pollefeys (230 papers)
Cordelia Schmid (206 papers)

Citations (171)

View on Semantic Scholar

Leveraging Photometric Consistency over Time for Sparsely Supervised Hand-Object Reconstruction (2004.13449v1)

Sparsely Supervised Hand-Object Reconstruction via Photometric Consistency

Summary of the Approach

Evaluation and Results

Implications and Future Directions

Related Papers