CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model (2403.05034v1)

Published 8 Mar 2024 in cs.CV and cs.LG

Abstract: Feed-forward 3D generative models like the Large Reconstruction Model (LRM) have demonstrated exceptional generation speed. However, the transformer-based methods do not leverage the geometric priors of the triplane component in their architecture, often leading to sub-optimal quality given the limited size of 3D data and slow training. In this work, we present the Convolutional Reconstruction Model (CRM), a high-fidelity feed-forward single image-to-3D generative model. Recognizing the limitations posed by sparse 3D data, we highlight the necessity of integrating geometric priors into network design. CRM builds on the key observation that the visualization of triplane exhibits spatial correspondence of six orthographic images. First, it generates six orthographic view images from a single input image, then feeds these images into a convolutional U-Net, leveraging its strong pixel-level alignment capabilities and significant bandwidth to create a high-resolution triplane. CRM further employs Flexicubes as geometric representation, facilitating direct end-to-end optimization on textured meshes. Overall, our model delivers a high-fidelity textured mesh from an image in just 10 seconds, without any test-time optimization.

References (1)

Nielson, G.M.: Dual marching cubes. In: IEEE visualization 2004. pp. 489–496. IEEE (2004)

Citations (84)

View on Semantic Scholar

Summary

The paper introduces CRM, a novel convolutional model that transforms a single image into a high-fidelity 3D textured mesh by leveraging orthographic images and canonical coordinate maps in a U-Net framework.
It employs triplanes and Flexicubes to efficiently decode SDF values and texture colors via dual marching cubes, achieving mesh generation in approximately 10 seconds.
The approach reduces training costs and enhances output quality, providing significant potential for applications in VR, gaming, and architectural visualization.

Introducing Convolutional Reconstruction Model (CRM): A Fast and Efficient Method for Single Image to 3D Textured Mesh Transformation

Overview

The domain of 3D generation has seen considerable advancements, buoyed primarily by models that leverage transformers. However, these advancements have often been hampered by the data sparsity issue inherent to 3D content creation and the sub-optimal exploitation of geometric priors within transformer-based methodologies. Addressing these challenges, we introduce the Convolutional Reconstruction Model (CRM), a novel approach for rapidly transforming a single image into a high-fidelity 3D textured mesh. Unlike its predecessors, CRM is designed to fully leverage geometric priors, resulting in superior quality meshes with faster training times.

Insight and Innovation

Central to CRM is the identification of a spatial correspondence between triplanes and six orthographic images, a characteristic underutilized by previous 3D generative models. Triplanes, pivotal due to their ability to generate high-resolution 3D content efficiently, exhibit a geometric alignment with the visual presentation of orthographic projections. Recognizing this, CRM employs a convolutional network architecture specifically tailored to harness this spatial alignment. By inputting orthographic images and canonical coordinate maps (CCMs) directly into a U-Net convolutional framework, CRM achieves significant quality improvements and efficiency in 3D mesh generation.

Moreover, CRM integrates Flexicubes as the geometric representation, enabling end-to-end optimization directly on textured meshes. This departure from traditional representations used in prior work (such as NeRF or Gaussian Splatting) allows for a more straightforward extraction of high-quality meshes without necessitating post-processing steps.

Methodology

CRM's novel approach commences with the generation of six orthographic images and CCMs from a single input image, facilitated by a specially developed multi-view diffusion model. These generated images, alongside CCMs, serve as the input for the CRM, which employs a convolutional U-Net to output a triplane.

The convolutional design, selected for its strong pixel-level alignment capabilities, effectively maps input images to the triplane, resulting in intricate details in the final mesh output. Furthermore, the inclusion of CCMs enhances the model's geometric understanding and improves overall mesh quality.

For 3D mesh generation, the rolled-out triplane generated by the U-Net is decoded into SDF values, texture colors, and Flexicubes parameters through a series of MLP networks. This process culminates in the generation of textured meshes via dual marching cubes, all within a mere 10-second timeframe.

Implications and Potential

CRM not only demonstrates a marked improvement in the fidelity of generated 3D meshes but also highlights the importance of geometric priors in enhancing generative models' efficiency and output quality. With its significantly reduced training costs and rapid generation capability, CRM stands as a promising advancement in the field of 3D content creation.

Furthermore, CRM's methodology paves the way for future developments in 3D generative models, particularly in how geometric considerations can be ingeniously integrated into model architectures. Its potential applications span from virtual reality and gaming to more practical uses such as product design and architectural visualization.

Conclusion

In summary, the Convolutional Reconstruction Model introduces a transformative approach to generating textured 3D meshes from single images. By utilizing a convolutional network architecture that taps into the inherent geometric priors of triplane components and employing Flexicubes for direct mesh optimization, CRM sets a new benchmark in speed, efficiency, and quality for 3D generative models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/taziku_co/status/1767135268036075606

https://twitter.com/_akhaliq/status/1767031482302816411

https://twitter.com/WilliamLamkin/status/1767112811107254566

https://twitter.com/dimid_ml/status/1773002042589548998

YouTube

Show All Videos