- The paper introduces OSX, a one-stage framework with a Component Aware Transformer that unifies global and local body component estimation.
- It employs innovative feature upsampling and keypoint-guided deformable attention to precisely recover hand and face parameters.
- Experimental results show error reductions of 9.5% on AGORA and 7.8% on EHF, highlighting its efficiency over multi-stage methods.
One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer
Introduction
This paper addresses the challenging task of recovering 3D whole-body meshes, consisting of human body poses, hand gestures, and facial expressions, from monocular images. Current state-of-the-art methods largely rely on multi-stage pipelines, where separate models handle different body components like the face and hands. These pipelines result in high computational complexity and often produce disjointed results. The authors propose a novel solution, named OSX, which is a one-stage framework leveraging a Component Aware Transformer (CAT) for more efficient and integrated processing.
Methodology
One-Stage Framework
OSX aims to simplify the cumbersome processes inherent in multi-stage methods. It utilizes a Vision Transformer (ViT) structure, combining a global body encoder with a local component-specific decoder.
- Component Aware Transformer (CAT): The CAT consists of an encoder that processes body tokens to capture global correlations, predicting body parameters, and a decoder that fine-tunes the resolution for hands and faces.
- Feature Upsampling: The authors introduce a differentiable feature-level upsampling mechanism instead of traditional image-level scaling. This provides higher resolution feature maps allowing more precise localization of hands and face.
- Keypoint-Guided Deformable Attention: The decoder uses keypoint information for guided attention, enhancing the precision of hand and face parameter regression.
Results and Evaluation
The method was extensively tested against baseline methods on datasets like AGORA, EHF, and 3DPW. OSX demonstrated marked improvements over state-of-the-art models in terms of MPVPE, showcasing an error reduction of 9.5% on AGORA and 7.8% on EHF. These gains are attributed to the integrated processing capability of the one-stage framework, which inherently results in more coherent mesh predictions.
Introduction of UBody
The authors recognize a gap in current datasets, which don't adequately represent scenarios where upper body expression is critical, such as in gesture and sign language recognition. To address this, they introduce UBody, a large-scale dataset featuring a diverse set of real-life scenes focused on expressive upper body motions. UBody includes comprehensive annotations, allowing for precise benchmarking and training, especially in scenarios with frequent occlusions and varying resolutions of body parts.
Implications and Future Directions
The proposed one-stage approach holds significant implications in terms of simplifying the pipeline for 3D mesh recovery, reducing computational overhead, and improving integration across body components. Future work may explore enhanced usage of hand and face-specific datasets to further refine the predictive capabilities of the OSX framework. Additionally, applying this framework to real-life applications such as virtual reality, gesture recognition, and human-computer interaction could yield practical benefits.
The introduction of UBody facilitates further exploration in expressive body mesh recovery, encouraging methods that can generalize to various real-world applications. The dataset is set to inspire further research into domains where precise upper body representation is crucial.
In summary, the paper contributes a novel perspective on whole-body mesh recovery with a streamlined, efficient approach and augments the field with a robust dataset tailored for expressive upper body applications.