MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model (2408.10198v1)

Published 19 Aug 2024 in cs.CV and cs.GR

Abstract: Open-world 3D reconstruction models have recently garnered significant attention. However, without sufficient 3D inductive bias, existing methods typically entail expensive training costs and struggle to extract high-quality 3D meshes. In this work, we introduce MeshFormer, a sparse-view reconstruction model that explicitly leverages 3D native structure, input guidance, and training supervision. Specifically, instead of using a triplane representation, we store features in 3D sparse voxels and combine transformers with 3D convolutions to leverage an explicit 3D structure and projective bias. In addition to sparse-view RGB input, we require the network to take input and generate corresponding normal maps. The input normal maps can be predicted by 2D diffusion models, significantly aiding in the guidance and refinement of the geometry's learning. Moreover, by combining Signed Distance Function (SDF) supervision with surface rendering, we directly learn to generate high-quality meshes without the need for complex multi-stage training processes. By incorporating these explicit 3D biases, MeshFormer can be trained efficiently and deliver high-quality textured meshes with fine-grained geometric details. It can also be integrated with 2D diffusion models to enable fast single-image-to-3D and text-to-3D tasks. Project page: https://meshformer3d.github.io

Citations (6)

View on Semantic Scholar

Summary

The paper demonstrates that MeshFormer achieves superior 3D reconstruction by integrating 3D sparse voxels, transformers, and SDF supervision.
The proposed model uses a unified single-stage training strategy that enhances efficiency and geometric accuracy over traditional dense-view methods.
Empirical results show higher F-scores and lower Chamfer distances, highlighting its practical benefits for applications like VR, gaming, and digital content creation.

MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model

In the context of recent advances in 3D reconstruction models, the paper titled "MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model" introduces an innovative approach to sparse-view reconstruction of high-quality 3D textured meshes. The proposed model, MeshFormer, leverages explicit 3D native structures and projective bias to achieve efficient and superior performance compared to existing methods.

Overview of Methodology

MeshFormer deviates from conventional methods that typically rely on dense input views and lengthy processing times. It introduces a sparse-view reconstruction model equipped with several novel features:

3D Sparse Voxels and Transformers: The model employs a combination of 3D sparse voxel representations and transformers with 3D convolutions. This approach contrasts with the triplane representation adopted by some recent large reconstruction models, which often lack spatial precision and efficiency. By incorporating the explicit 3D structure, MeshFormer enhances the correspondence between 3D voxels and 2D multi-view features.
Integration of Signed Distance Function (SDF) Supervision and Surface Rendering: MeshFormer combines SDF supervision with surface rendering to directly learn high-quality meshes. This methodology avoids complex multi-stage training processes that plague other approaches and results in faster and more stable training. Explicit geometry supervision through SDF contributes to improved mesh quality and training efficiency.
Utilization of Multi-View Normal Maps: The input to MeshFormer includes not just sparse-view RGB images but also their corresponding normal maps. These normal maps provide crucial geometric information and aid in the refinement of the learning geometry. The normal maps can be obtained from 2D diffusion models, such as Zero123++, leading to more informed and detailed reconstruction.

Numerical Results and Comparative Analysis

MeshFormer demonstrates strong numerical results, significantly outperforming its contemporaries on key metrics. Evaluated on datasets like GSO and OmniObject3D, MeshFormer achieves:

Higher F-Scores: Indicative of better mesh precision and recall.
Lower Chamfer Distances (CD): Reflecting improved geometric accuracy.
Superior PSNR and LPIPS Scores: Illustrating enhanced visual quality in rendered images.

For instance, on the GSO dataset, MeshFormer attains a Chamfer distance of 0.031 and an F-score of 0.963, outperforming competitive methods like One-2-3-45++ and MeshLRM. Moreover, it achieves this performance with significantly reduced computational resources, requiring just 8 GPUs for training compared to the over one hundred GPUs necessary for some baseline models.

Technical Contributions

The key technical contributions of MeshFormer can be summarized as follows:

Efficient Training with Explicit 3D Structures:

MeshFormer's architecture allows for more efficient training, leveraging 3D sparse voxels rather than relying on triplanes. This results in better representation of complex structures and fewer artifacts.
Unified Single-Stage Training Strategy:

By combining surface rendering with explicit 3D SDF supervision, MeshFormer avoids the instability and inefficiency of multi-stage training processes, thereby streamlining the learning of high-quality meshes.
Geometric Enhancement through Normal Textures:

The model predicts additional 3D normal textures, which can be used to enhance the mesh geometry post hoc. This ensures the reconstructed meshes have sharp and fine-grained geometric details.

Practical and Theoretical Implications

Practically, MeshFormer democratizes high-quality 3D asset creation, making it accessible to users with limited resources while reducing training times and computational demands. Theoretically, it challenges the prevailing reliance on dense input views, showing that sparse-view models with 3D guidance can achieve competitive or superior results. This paradigm shift opens new avenues for sparse-view 3D reconstruction research and applications, including in areas such as virtual reality, gaming, and digital content creation.

Future Speculations

Looking ahead, future developments could focus on further improving multi-view image predictions from 2D diffusion models to reduce dependencies on perfect normal maps. Additionally, integrating more robust mechanisms to handle occlusions and visibility challenges in sparse-view settings could refine the model's practical applicability. Further research could also explore adaptive 3D representation and dynamic attention mechanisms within the transformer architecture to enhance flexibility and performance.

In conclusion, MeshFormer represents a significant step forward in the domain of 3D mesh generation, offering a balanced approach that leverages explicit 3D knowledge, efficient training processes, and multi-view geometric guidance to produce high-quality outputs in a computationally efficient manner. This model demonstrates that high-quality 3D reconstruction is achievable with sparsely available views, heralding a new era of efficiency and quality in 3D reconstruction methodologies.

Related Papers

GitHub

MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model

Tweets

https://twitter.com/_akhaliq/status/1825751938534637982

https://twitter.com/javaeeeee1/status/1826040482985816574

https://twitter.com/arXivGPT/status/1826329570825429132

YouTube

Show All Videos