One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer (2303.16160v1)

Published 28 Mar 2023 in cs.CV

Abstract: Whole-body mesh recovery aims to estimate the 3D human body, face, and hands parameters from a single image. It is challenging to perform this task with a single network due to resolution issues, i.e., the face and hands are usually located in extremely small regions. Existing works usually detect hands and faces, enlarge their resolution to feed in a specific network to predict the parameter, and finally fuse the results. While this copy-paste pipeline can capture the fine-grained details of the face and hands, the connections between different parts cannot be easily recovered in late fusion, leading to implausible 3D rotation and unnatural pose. In this work, we propose a one-stage pipeline for expressive whole-body mesh recovery, named OSX, without separate networks for each part. Specifically, we design a Component Aware Transformer (CAT) composed of a global body encoder and a local face/hand decoder. The encoder predicts the body parameters and provides a high-quality feature map for the decoder, which performs a feature-level upsample-crop scheme to extract high-resolution part-specific features and adopt keypoint-guided deformable attention to estimate hand and face precisely. The whole pipeline is simple yet effective without any manual post-processing and naturally avoids implausible prediction. Comprehensive experiments demonstrate the effectiveness of OSX. Lastly, we build a large-scale Upper-Body dataset (UBody) with high-quality 2D and 3D whole-body annotations. It contains persons with partially visible bodies in diverse real-life scenarios to bridge the gap between the basic task and downstream applications.

Citations (70)

View on Semantic Scholar

Summary

The paper introduces OSX, a one-stage framework with a Component Aware Transformer that unifies global and local body component estimation.
It employs innovative feature upsampling and keypoint-guided deformable attention to precisely recover hand and face parameters.
Experimental results show error reductions of 9.5% on AGORA and 7.8% on EHF, highlighting its efficiency over multi-stage methods.

One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer

Introduction

This paper addresses the challenging task of recovering 3D whole-body meshes, consisting of human body poses, hand gestures, and facial expressions, from monocular images. Current state-of-the-art methods largely rely on multi-stage pipelines, where separate models handle different body components like the face and hands. These pipelines result in high computational complexity and often produce disjointed results. The authors propose a novel solution, named OSX, which is a one-stage framework leveraging a Component Aware Transformer (CAT) for more efficient and integrated processing.

Methodology

One-Stage Framework

OSX aims to simplify the cumbersome processes inherent in multi-stage methods. It utilizes a Vision Transformer (ViT) structure, combining a global body encoder with a local component-specific decoder.

Component Aware Transformer (CAT): The CAT consists of an encoder that processes body tokens to capture global correlations, predicting body parameters, and a decoder that fine-tunes the resolution for hands and faces.
Feature Upsampling: The authors introduce a differentiable feature-level upsampling mechanism instead of traditional image-level scaling. This provides higher resolution feature maps allowing more precise localization of hands and face.
Keypoint-Guided Deformable Attention: The decoder uses keypoint information for guided attention, enhancing the precision of hand and face parameter regression.

Results and Evaluation

The method was extensively tested against baseline methods on datasets like AGORA, EHF, and 3DPW. OSX demonstrated marked improvements over state-of-the-art models in terms of MPVPE, showcasing an error reduction of 9.5% on AGORA and 7.8% on EHF. These gains are attributed to the integrated processing capability of the one-stage framework, which inherently results in more coherent mesh predictions.

Introduction of UBody

The authors recognize a gap in current datasets, which don't adequately represent scenarios where upper body expression is critical, such as in gesture and sign language recognition. To address this, they introduce UBody, a large-scale dataset featuring a diverse set of real-life scenes focused on expressive upper body motions. UBody includes comprehensive annotations, allowing for precise benchmarking and training, especially in scenarios with frequent occlusions and varying resolutions of body parts.

Implications and Future Directions

The proposed one-stage approach holds significant implications in terms of simplifying the pipeline for 3D mesh recovery, reducing computational overhead, and improving integration across body components. Future work may explore enhanced usage of hand and face-specific datasets to further refine the predictive capabilities of the OSX framework. Additionally, applying this framework to real-life applications such as virtual reality, gesture recognition, and human-computer interaction could yield practical benefits.

The introduction of UBody facilitates further exploration in expressive body mesh recovery, encouraging methods that can generalize to various real-world applications. The dataset is set to inspire further research into domains where precise upper body representation is crucial.

In summary, the paper contributes a novel perspective on whole-body mesh recovery with a streamlined, efficient approach and augments the field with a robust dataset tailored for expressive upper body applications.

Related Papers

GitHub

Tweets

https://twitter.com/carlosedubarret/status/1694090447675682906

https://twitter.com/carlosedubarret/status/1693989833532915868

https://twitter.com/ProtoPomp/status/1694367329000329335