Multi-person Implicit Reconstruction from a Single Image (2104.09283v1)

Published 19 Apr 2021 in cs.CV

Abstract: We present a new end-to-end learning framework to obtain detailed and spatially coherent reconstructions of multiple people from a single image. Existing multi-person methods suffer from two main drawbacks: they are often model-based and therefore cannot capture accurate 3D models of people with loose clothing and hair; or they require manual intervention to resolve occlusions or interactions. Our method addresses both limitations by introducing the first end-to-end learning approach to perform model-free implicit reconstruction for realistic 3D capture of multiple clothed people in arbitrary poses (with occlusions) from a single image. Our network simultaneously estimates the 3D geometry of each person and their 6DOF spatial locations, to obtain a coherent multi-human reconstruction. In addition, we introduce a new synthetic dataset that depicts images with a varying number of inter-occluded humans and a variety of clothing and hair styles. We demonstrate robust, high-resolution reconstructions on images of multiple humans with complex occlusions, loose clothing and a large variety of poses and scenes. Our quantitative evaluation on both synthetic and real-world datasets demonstrates state-of-the-art performance with significant improvements in the accuracy and completeness of the reconstructions over competing approaches.

Authors (4)

Armin Mustafa (31 papers)
Akin Caliskan (8 papers)
Lourdes Agapito (42 papers)
Adrian Hilton (39 papers)

Citations (16)

View on Semantic Scholar

Summary

The paper introduces a model-free approach that reconstructs realistic 3D human models, capturing details like loose clothing and hair from a single image.
It employs a two-stage learning architecture combining multitask depth and segmentation estimation with implicit volumetric refinement, boosting reconstruction accuracy.
The novel synthetic dataset (MPSD) and rigorous evaluation validate its effectiveness for applications in AR/VR, surveillance, and dynamic scene analysis.

Overview of "Multi-person Implicit Reconstruction from a Single Image"

The paper "Multi-person Implicit Reconstruction from a Single Image" introduces a novel approach for generating detailed 3D reconstructions of multiple people from a single 2D image. The authors have developed an end-to-end learning framework that bypasses the limitations of existing model-based methods, which often fail to capture loose clothing and hair on account of their reliance on parametric body models. Additionally, many current techniques require manual intervention to resolve occlusions, a problem the proposed method efficiently addresses.

Main Contributions

The authors present several key innovations:

Model-Free Multi-Person 3D Reconstruction: The paper pioneers an end-to-end system capable of reconstructing multiple humans with realistic clothing and hairstyles. This system does not rely on predefined 3D models and can faithfully render individuals in crowded or occluded scenes.
Synthetic Dataset: A new synthetic dataset, MPSD, was introduced to support training and evaluation. It includes a diverse range of scenarios with multiple occluded individuals, various attire, and complex hairstyles.
Two-Stage Learning Architecture: The method consists of two main components—a multitask network for simultaneous depth and segmentation estimation, and a network for implicit 3D reconstruction that uses intermediate volumetric representations refined via implicit functions.
Robust Evaluation: Quantitative results demonstrate significant improvements over state-of-the-art techniques in terms of 3D reconstruction accuracy and coherence, validated against both synthetic and real-world data.

Quantitative Results

The system demonstrates considerable advancements in the accuracy and completeness of reconstructions, outperforming several existing methodologies. It accurately reconstructs scenes of multiple individuals with varied clothing and hairstyles, even under conditions involving significant inter-person occlusion.

Implications and Future Research

The implications of this work are multifold. Practically, it enhances applications in surveillance, AR/VR content generation, and entertainment by enabling cost-effective, high-fidelity human modeling from single-camera setups. Theoretically, it pushes the boundaries of computer vision, particularly regarding the interpretation of monocular cues for complex human-centric scenes.

Future research could focus on further refining the implicit reconstruction process, exploring temporal coherence in video data, and improving robustness against extreme occlusions or poses. Given the capability of handling occlusions and loose clothing, extending this approach to real-time systems and integrating it with motion capture technologies could open new prospects in dynamic scene understanding and interactive content creation.

Overall, this work paves the way for more sophisticated multi-person modeling frameworks that transcend the limitations of current parametric approaches, offering a comprehensive toolset for tackling complex monocular human reconstruction challenges.

PDF Markdown

Related Papers

YouTube

Show All Videos