Mesh Graphormer (2104.00272v2)

Published 1 Apr 2021 in cs.CV

Abstract: We present a graph-convolution-reinforced transformer, named Mesh Graphormer, for 3D human pose and mesh reconstruction from a single image. Recently both transformers and graph convolutional neural networks (GCNNs) have shown promising progress in human mesh reconstruction. Transformer-based approaches are effective in modeling non-local interactions among 3D mesh vertices and body joints, whereas GCNNs are good at exploiting neighborhood vertex interactions based on a pre-specified mesh topology. In this paper, we study how to combine graph convolutions and self-attentions in a transformer to model both local and global interactions. Experimental results show that our proposed method, Mesh Graphormer, significantly outperforms the previous state-of-the-art methods on multiple benchmarks, including Human3.6M, 3DPW, and FreiHAND datasets. Code and pre-trained models are available at https://github.com/microsoft/MeshGraphormer

Citations (263)

View on Semantic Scholar

Summary

The paper introduces Mesh Graphormer, a novel hybrid architecture that combines transformers with graph convolutional networks for precise 3D human mesh reconstruction.
The approach integrates global long-range dependencies and local mesh topology through graph-enhanced transformer blocks, achieving a PA-MPJPE of 34.5 mm on Human3.6M.
Comprehensive ablation studies and experiments across multiple datasets highlight the model's effectiveness and its potential impact on VR, motion capture, and interactive applications.

Mesh Graphormer: A Transformer Approach for 3D Human Mesh Reconstruction

The paper introduces a novel architecture named Mesh Graphormer, which intelligently merges the strengths of Transformers and Graph Convolutional Neural Networks (GCNNs) to achieve superior results in 3D human pose and mesh reconstruction from single images. The approach stands out by its ability to model both global and local interactions efficiently, thus addressing the inherent complexities of human body articulation.

Motivation and Methodology

Human pose and mesh reconstruction have seen significant advancements through the use of Transformers and GCNNs. Transformers excel in capturing long-range dependencies among 3D mesh vertices, whereas GCNNs efficiently handle local interactions using mesh topology. The Mesh Graphormer innovatively integrates these methodologies to capitalize on their individual strengths.

The core of Mesh Graphormer's architecture is a graph-convolution-reinforced transformer encoder. This design incorporates graph convolutions into the transformer blocks, optimizing the representation of both local neighborhood interactions and broader spatial dependencies. This dual capability is achieved without imposing significant computational overhead, demonstrating an exceptional balance between complexity and performance.

Experimental Evaluation

The Mesh Graphormer was rigorously evaluated against existing methodologies across several prominent datasets, including Human3.6M, 3DPW, and FreiHAND. In each case, it delivered refined accuracy improvements. Notably, on the Human3.6M dataset, it achieved a PA-MPJPE of 34.5 mm, showing enhanced accuracy over models like METRO.

The model leverages image grid features, which proved instrumental in refining 3D coordinate prediction. By enabling joints and vertices to attend to image grid features, the network enhances its resolution of local details, a significant limitation of previous transformer-based approaches.

Ablation Studies and Architectural Insights

The paper provides comprehensive ablation studies elucidating the contributions of various design elements. The inclusion of graph convolutions post-Multi-Head Self-Attention (MHSA) layers was particularly effective, enhancing the model's ability to capture both fine-grained and global context. Comparative studies highlighted the superior performance of a graph residual block over basic graph convolution layers, underscoring its role in the proposed framework.

Implications and Future Directions

The integration of graph convolutions into transformer architectures proposes a powerful paradigm for future research in mesh reconstruction and potentially other domains requiring sophisticated spatial modeling. The work opens avenues for more extensive exploration of hybrid architectures combining the spatial precision of convolutional methods with the contextual depth afforded by attention mechanisms.

Practical implications include enhanced accuracy in applications such as virtual reality, motion capture, and interactive systems. Future research may explore optimizing the balance between computational efficiency and accuracy further, expanding the dataset diversity for training to encompass more varied human morphologies and interaction scenarios. This could enhance robustness and adaptability across broader application contexts.

Overall, the Mesh Graphormer is a substantial step forward in the field of 3D reconstruction, offering a compelling argument for hybrid model architectures that leverage the best of both convolutional and attention-based worlds.

PDF Markdown

Related Papers

GitHub

GitHub - microsoft/MeshGraphormer: Research code of ICCV 2021 paper "Mesh Graphormer" (398 stars)

Tweets

https://twitter.com/_akhaliq/status/1377785977167474689

https://twitter.com/DennyboyWhite/status/1424262244100149248