Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images (1804.01654v2)

Published 5 Apr 2018 in cs.CV

Abstract: We propose an end-to-end deep learning architecture that produces a 3D shape in triangular mesh from a single color image. Limited by the nature of deep neural network, previous methods usually represent a 3D shape in volume or point cloud, and it is non-trivial to convert them to the more ready-to-use mesh model. Unlike the existing methods, our network represents 3D mesh in a graph-based convolutional neural network and produces correct geometry by progressively deforming an ellipsoid, leveraging perceptual features extracted from the input image. We adopt a coarse-to-fine strategy to make the whole deformation procedure stable, and define various of mesh related losses to capture properties of different levels to guarantee visually appealing and physically accurate 3D geometry. Extensive experiments show that our method not only qualitatively produces mesh model with better details, but also achieves higher 3D shape estimation accuracy compared to the state-of-the-art.

Citations (1,262)

View on Semantic Scholar

Summary

The paper introduces an end-to-end deep learning model that transforms an initial ellipsoid into an accurate 3D triangular mesh using single RGB images.
It utilizes a graph-based convolutional neural network with a coarse-to-fine deformation strategy and multiple mesh-related losses to ensure visual and structural precision.
Evaluations on the ShapeNet dataset show that Pixel2Mesh outperforms competing methods like 3D-R2N2 and PSG in F-score, Chamfer Distance, and Earth Mover's Distance.

Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images

Pixel2Mesh introduces an innovative end-to-end deep learning architecture capable of generating 3D shape models as triangular meshes from single RGB images. This paper explores mesh representation, which is more suitable for various real-world applications compared to traditional volumetric or point cloud representations.

Methodology and Architecture

The proposed architecture stands out due to its utilization of a graph-based convolutional neural network (GCN) to represent 3D meshes. The network progressively deforms an initial ellipsoid to align with the target geometry based on perceptual features extracted from the input image. This framework adopts a coarse-to-fine strategy to ensure stability throughout the deformation process and implements multiple mesh-related losses to maintain both visual and physical accuracy.

The core components of Pixel2Mesh are as follows:

Image Feature Network: A VGG-16-based convolutional network that extracts features from the input image.
Coarse-to-Fine Mesh Deformation Network: A series of mesh deformation blocks connected by two graph unpooling layers. The deformation blocks incrementally refine the mesh by using the perceptual features extracted from the image feature network.
Graph Unpooling Layers: These layers increase the mesh resolution, enabling the network to capture finer details progressively.

Loss Functions

The paper defines four critical loss functions essential for training the model:

Chamfer Loss: Ensuring vertex positions approximate the target mesh.
Surface Normal Loss: Promoting smooth surface characteristics by aligning predicted normals with those of the ground truth.
Laplacian Regularization: Preventing excessive local deformations and maintaining the mesh’s geometric regularity.
Edge Length Regularization: Reducing flying vertices by penalizing excessive edge lengths.

Experimental Results

Pixel2Mesh was evaluated extensively using the ShapeNet dataset and compared against prominent state-of-the-art methods, such as 3D-R2N2 and PSG. The results highlight the superiority of the proposed method in terms of F-score, Chamfer Distance (CD), and Earth Mover's Distance (EMD). Specifically:

F-Score: Pixel2Mesh achieved average scores of 59.72% at threshold $\tau$ and 74.19% at threshold $2\tau$ , outperforming other methods such as 3D-R2N2 and PSG.
Chamfer Distance and EMD: The model consistently showed lower CD and EMD values compared to competitors, indicating better reconstruction accuracy and surface regularity.

Implications and Future Developments

The paper affirms the potential of mesh representations in generating high-quality 3D models from 2D images. The ability to maintain surface details and avoid the resolution limitations inherent in voxel-based approaches underlines the practical value of the Pixel2Mesh architecture.

Future research could expand on this work in several directions:

Topology Variability: Extending the architecture to handle diverse topologies beyond the genus 0 constraints of the initial ellipsoid.
Scene-Level Reconstruction: Scaling the approach to reconstruct entire scenes from single or multiple images.
Multi-View Learning: Enhancing performance and robustness by incorporating multi-view constraints, enabling more accurate reconstructions from varied perspectives.

In conclusion, Pixel2Mesh is a significant contribution to the field of 3D shape generation, effectively leveraging mesh representations for more reliable and detailed 3D reconstructions from single RGB images. The methods and insights provided pave the way for further advancements in computer vision and graphics applications.

PDF Markdown