End-to-End Human Pose and Mesh Reconstruction with Transformers (2012.09760v3)

Published 17 Dec 2020 in cs.CV

Abstract: We present a new method, called MEsh TRansfOrmer (METRO), to reconstruct 3D human pose and mesh vertices from a single image. Our method uses a transformer encoder to jointly model vertex-vertex and vertex-joint interactions, and outputs 3D joint coordinates and mesh vertices simultaneously. Compared to existing techniques that regress pose and shape parameters, METRO does not rely on any parametric mesh models like SMPL, thus it can be easily extended to other objects such as hands. We further relax the mesh topology and allow the transformer self-attention mechanism to freely attend between any two vertices, making it possible to learn non-local relationships among mesh vertices and joints. With the proposed masked vertex modeling, our method is more robust and effective in handling challenging situations like partial occlusions. METRO generates new state-of-the-art results for human mesh reconstruction on the public Human3.6M and 3DPW datasets. Moreover, we demonstrate the generalizability of METRO to 3D hand reconstruction in the wild, outperforming existing state-of-the-art methods on FreiHAND dataset. Code and pre-trained models are available at https://github.com/microsoft/MeshTransformer.

Citations (556)

View on Semantic Scholar

Summary

The paper presents METRO, a transformer encoder-based method that reconstructs 3D human pose and mesh vertices without relying on traditional parametric models.
It introduces Masked Vertex Modeling to robustly capture non-local interactions among mesh vertices and joints, enhancing performance under occlusion.
Experimental results demonstrate state-of-the-art accuracy on datasets like Human3.6M, 3DPW, and FreiHAND, showcasing METRO’s versatility in 3D reconstruction.

End-to-End Human Pose and Mesh Reconstruction with Transformers: An Expert Overview

The paper introduces MEsh TRansfOrmer (METRO), a novel method for reconstructing 3D human pose and mesh vertices from a single image. Utilizing a transformer encoder, METRO simultaneously models interactions between mesh vertices and body joints, producing 3D joint coordinates and mesh vertices. A key distinction from traditional methods is that METRO operates independently of parametric mesh models such as SMPL, enhancing its adaptability to various objects, including hands.

Methodology

METRO's approach revolves around leveraging a transformer encoder to capture complex, non-local relationships among mesh vertices and joints. Traditional methods often rely on regression of pose and shape parameters, constrained by parametric models that restrict the diversity of pose and shape spaces. In contrast, METRO's parametric-independent framework avoids these limitations and is thus more flexible.

A significant feature of METRO is the "Masked Vertex Modeling". This technique enhances robustness, particularly under challenging conditions like partial occlusions, by allowing the model to learn and focus on relevant parts of the human mesh. The attention mechanism within the transformer freely interacts with any vertices, thus improving the ability to predict 3D shapes and poses under varied and challenging conditions.

Experimental Results

METRO achieves state-of-the-art performance on renowned datasets such as Human3.6M and 3DPW. These results underline the method's effectiveness at learning vertex-vertex and vertex-joint interactions, demonstrating superior performance relative to existing methods. Notably, while previous approaches like GraphCMR and SPIN utilized parametric constraints for mesh generation, METRO's free attention mechanism elevates performance levels.

Further, the framework generalizes well to the domain of 3D hand reconstruction, as evidenced by its performance on the FreiHAND dataset, indicating its versatility beyond human body mesh reconstruction.

Implications and Future Directions

The implications of this research are profound both theoretically and practically. The elimination of reliance on parametric models like SMPL enables broader applications across different types of mesh reconstruction tasks. From a practical standpoint, METRO could greatly benefit fields such as virtual reality and biomedical applications, where accurate and flexible human shape modeling is critical.

Theoretically, the integration of transformers opens pathways for further exploration into more complex non-local interactions in 3D mesh reconstructions. Future research could delve into enhancing transformer model architectures tailored for 3D computer vision tasks or further improving robustness against more complex occlusion scenarios.

Conclusion

In summary, METRO presents a robust and flexible framework for 3D mesh reconstruction utilizing transformer-based strategies, marking a significant departure from the established dependency on parametric models. Its ability to handle complex interactions and extend to diverse objects showcases the transformative potential of attention mechanisms in 3D human and object modeling. As research continues, further refinements in transformer architectures and training paradigms could expand the horizons for this approach, making it a promising cornerstone for future advancements in AI-based 3D modeling.

PDF Markdown

Related Papers

GitHub

GitHub - microsoft/MeshTransformer: Research code for CVPR 2021 paper "End-to-End Human Pose and Mesh Reconstruction with Transformers" (616 stars)