Appearance Consensus Driven Self-Supervised Human Mesh Recovery (2008.01341v1)

Published 4 Aug 2020 in cs.CV

Abstract: We present a self-supervised human mesh recovery framework to infer human pose and shape from monocular images in the absence of any paired supervision. Recent advances have shifted the interest towards directly regressing parameters of a parametric human model by supervising them on large-scale datasets with 2D landmark annotations. This limits the generalizability of such approaches to operate on images from unlabeled wild environments. Acknowledging this we propose a novel appearance consensus driven self-supervised objective. To effectively disentangle the foreground (FG) human we rely on image pairs depicting the same person (consistent FG) in varied pose and background (BG) which are obtained from unlabeled wild videos. The proposed FG appearance consistency objective makes use of a novel, differentiable Color-recovery module to obtain vertex colors without the need for any appearance network; via efficient realization of color-picking and reflectional symmetry. We achieve state-of-the-art results on the standard model-based 3D pose estimation benchmarks at comparable supervision levels. Furthermore, the resulting colored mesh prediction opens up the usage of our framework for a variety of appearance-related tasks beyond the pose and shape estimation, thus establishing our superior generalizability.

Citations (37)

View on Semantic Scholar

Summary

The paper introduces a self-supervised framework that leverages co-salient foreground appearance from diverse image pairs to recover accurate 3D human meshes.
It employs a differentiable Color-recovery module to interpolate image colors directly onto 3D mesh vertices without relying on 3D pose annotations.
Utilizing reflectional symmetry, the method ensures complete appearance recovery and achieves state-of-the-art results on multiple benchmark datasets.

Overview of "Appearance Consensus Driven Self-Supervised Human Mesh Recovery"

This paper introduces a self-supervised framework for reconstructing 3D human meshes from monocular images, a task pertinent to multiple applications such as virtual reality, robotics, and video game development. The authors aim to overcome the restrictions associated with fully supervised methods that depend heavily on 2D or 3D annotated datasets. Instead of relying on paired data, their approach employs image pairs from unlabeled videos to develop an appearance consensus-driven self-supervised learning objective, significantly enhancing generalizability to novel environments.

Key Contributions

Self-Supervised Framework: The paper proposes a novel methodology to disentangle foreground (FG) appearances from their backgrounds (BG) by leveraging image pairs portraying the same individuals in varied poses and backgrounds. This is achieved without any paired pose-related supervision using co-salient FG appearance, which bolsters generalization across unseen data.
Color-Recovery Module: A differentiable Color-recovery module mitigates traditional dependency on appearance networks. It interpolates color data to 3D mesh vertices directly, utilizing spatial registration on the image plane to deliver accurate reconstructions in the absence of 3D pose annotations.
Reflectional Symmetry: The framework utilizes reflectional symmetry of the human body to propagate the colors from visible to invisible mesh vertices, ensuring a complete and consistent appearance recovery.
Superior Performance and Application Scope: The results on several benchmarks indicate state-of-the-art performance for model-based methodologies, reinforcing the potential for its application in multiple tasks beyond pose and shape estimation, such as mesh coloration, due to its robust FG appearance consistency.

Results and Implications

The framework demonstrates impressive numerical performance across diverse datasets, including Human3.6M, LSP, and 3DPW. Modeled predictions attain the stage-of-the-art results at comparable levels of supervision, outperforming existing model-based approaches. The framework's capacity to generalize to unseen datasets highlights its robustness, a necessary quality for practical deployment in the wild.

This work advocates for a shift towards self-supervised learning paradigms, which require less domain-specific annotated data, lowering costs, and enabling continuous adaptability to ever-changing visual scenes. The removal of dependency on paired 2D pose annotations significantly mitigates the historical domain gap issues between in-studio and in-the-wild scenarios, paving the way for more adaptable modeling techniques.

Future Developments

By setting a precedent for the use of self-supervised strategies in human mesh recovery, this work opens avenues for further exploration. Future research can focus on improving robustness under occlusions by external objects and refining mesh recovery in partially visible scenarios. Additionally, enhancing the realism of texture representation and expanding the framework's adaptability to varied human morphology types present compelling challenges for subsequent inquiries.

In conclusion, this paper lays a significant foundational step toward more generalized, versatile, and resource-efficient human pose and shape estimation techniques, with broad implications for the evolving AI landscape in vision-centered applications.

PDF Markdown

Related Papers

YouTube

Show All Videos