Pixel2Mesh++: Multi-View Mesh Reconstruction
- The paper demonstrates that incorporating cross-view photometric consistency into iterative graph-based deformation yields robust and precise 3D shape recovery.
- The method leverages multi-view feature aggregation, hypothesis sampling, and a residual GCN scoring network to iteratively refine mesh vertices.
- Empirical results on ShapeNet show improved Chamfer Distance, F-score, and IoU, underscoring the technique’s efficiency and robustness compared to related approaches.
Pixel2Mesh++ denotes a class of multi-view 3D mesh reconstruction frameworks that advance the Pixel2Mesh philosophy—iterative graph-based mesh deformation—by incorporating cross-view photometric consistency into the deformation reasoning process (Wen et al., 2022, Wen et al., 2019). Rather than regressing mesh geometry directly from color images, Pixel2Mesh++ infers a sequence of local deformations for each vertex, leveraging multiview perceptual features aggregated by explicitly computed camera geometries or pose networks. The architecture is distinguished by its hypothesis sampling mechanism, statistical pooling over multiple views, and graph convolutional scoring, enabling robust, precise, and topology-agnostic 3D shape recovery from few calibrated (or estimated) images.
1. Architectural Framework
Pixel2Mesh++ is organized as a two-stage pipeline:
Stage 0: Coarse Mesh Initialization
Two initialization regimes are supported:
- MVP2M (Ours-P): Multi-view extension of Pixel2Mesh, deforming an ellipsoid (2466 vertex mesh, fixed connectivity) per object, using features pooled from all views.
- MVDISN (Ours-D): Multi-view Deep Implicit Surface Network with Marching Cubes extraction (variable topology), generating more diverse initial mesh structures.
Stage 1: Multi-View Deformation Network (MDN)
MDN operates iteratively, refining the mesh via per-vertex local graph convolutions. Each mesh vertex is surrounded by 42 offset hypotheses (from a scaled icosahedron) plus itself, forming a 43-node local graph. For each hypothesized location, early-layer VGG features are sampled per input image and aggregated via view-wise mean, max, and standard deviation statistics, producing hypothesis features invariant to view order or number. These features, concatenated with 3D coordinates, are input to a 6-block residual GCN scoring network that outputs unnormalized scores (one per hypothesis). A softmax over these scores produces a convex combination, repositioning the vertex toward the most photometrically consistent hypothesis. Typically, three MDN iterations are sufficient for convergence.
Optionally, a differentiable rendering-based silhouette refinement step may be applied at test time by minimizing an silhouette loss between the mesh projection and observed silhouettes.
2. Multi-View Feature Sampling and Pooling
Pixel2Mesh++ explicitly projects each hypothesis for every mesh vertex into each image using either provided or predicted camera intrinsics/extrinsics. The projection for image is: where . At these locations, three levels of VGG-16 features ("conv1_2", "conv2_2", "conv3_3") are bilinearly sampled and concatenated, yielding a feature of dimension . For all input views, hypothesis features are aggregated by channel-wise statistics:
- = mean
- = standard deviation
- = max
These statistics, concatenated with the hypothesis location, form the per-hypothesis feature, yielding invariance to both view order and view count.
3. Deformation Reasoning via Local Graph Convolutions
For each vertex, hypothesis features are assembled into a 43-node local graph (42 in the icosahedral shell plus center), with edges mirroring icosahedral structure and spokes to the center. The deformation reasoning module is a stack of six residual GCN layers where each layer performs: 0 with skip connections as in the original Pixel2Mesh. After six GraphConv blocks, a linear layer predicts a scalar score 1 per hypothesis. Final scores are softmax-normalized: 2 yielding the new vertex location as: 3 This mechanism constitutes a differentiable "soft-argmax" over the local hypothesis cloud, producing robust and stable mesh refinements.
4. Optimization Objectives and Training Regimes
The supervised loss function combines several terms: 4 with coefficients 5. Details:
- Chamfer Distance 6: Computed between point clouds resampled uniformly from predicted and ground-truth mesh surfaces (4000 points per mesh; sampling uses 7, 8).
- Normal Consistency 9: Penalizes angular deviation between predicted and ground-truth face normals.
- Edge Length Regularization 0: Constraints edge lengths to discourage degenerate faces.
- Laplacian Smoothness 1: Encourages local surface smoothness via deviation of each vertex from the mean position of its neighbors.
- Silhouette Loss 2 (optional at test): Minimizes 3 difference between mesh silhouette and observed mask under differentiable rendering.
Training is performed on ShapeNet Core v2 (13 classes, 50k models, train/test by 3D-R2N2), using Adam optimizer, batch size 1, typically 3 input views per object per iteration, and (for MDN) an additional 20 epochs at 4 learning rate following the coarse shape initialization.
5. Empirical Performance and Robustness
Pixel2Mesh++ demonstrates state-of-the-art performance for multi-view mesh reconstruction on ShapeNet. Principal quantitative metrics include Chamfer Distance (5), 6-score at various thresholds (7), and volumetric IoU (8). Representative results (mean over 13 categories) are:
| Model | Chamfer (×10⁻³) | F(τ) | F(2τ) | IoU |
|---|---|---|---|---|
| MVP2M | 0.456 | 61.20 | 76.94 | 0.411 |
| Ours-P | 0.381 | 67.23 | 81.22 | 0.436 |
| Ours-D | 0.390 | 75.24 | 85.04 | 0.508 |
Qualitative assessments confirm recovery of thin structures (chair legs, lamp necks), fidelity from arbitrary viewpoints, and resilience to poor mesh initializations (including noisy or marching cubes-based meshes). Test-time silhouette refinement yields additional performance gains (+1–2% 9). Compared to neural renderer-based MVS methods, Pixel2Mesh++ achieves similar accuracy with far lower inference times (seconds versus minutes for 24-view IDR). The architecture accommodates plug-and-play camera pose estimation via a dedicated "Camera Pose Network," and is robust to camera pose errors: switching from ground-truth to estimated poses produces less than 1% drop in 0.
Cross-category generalization (e.g., holding out one class during MDN training) incurs only modest loss, and an MDN trained on one class can improve 1 on others by +5–10%. Performance increases with number of views at test time (e.g., 2: 64.5 with two views 3 68.3 with five).
6. Relationship to and Distinction from Related Methods
Pixel2Mesh++ advances upon:
- Pixel2Mesh (P2M): Original single-view mesh deformation via GCN, extended here to pool multi-view perception with explicit geometric reasoning [Wang et al. ECCV'18].
- MVP2M: Naive multi-view extension pooling features across images but lacking hypothesis sampling/statistical aggregation.
- MVDISN: Implicit surface prediction with mesh extraction via Marching Cubes for initialization.
- Neural Renderer Approaches (DVR, IDR): Pixel2Mesh++ achieves competitive reconstruction with dramatically lower computational cost and without requiring dense multi-view supervision or extensive rendering loops.
A significant distinguishing factor is the use of cross-view feature-statistics pooling at each local hypothesis, which integrates classical multi-view geometry concepts (e.g., spatial consistency, soft correspondence) in an end-to-end GCN framework for direct mesh output.
7. Generalization and Practical Significance
Pixel2Mesh++ demonstrates notable robustness to dataset domain shift (e.g., transfer from ShapeNet to ABC CAD models) and mesh initializations corrupted by noise. The architecture's modularity allows seamless integration of explicit or learned camera pose, variable input view counts, and test-time optimization stages. The iterative hypothesis sampling and statistical aggregation enable the network to correct both global misalignment and recover fine-grained geometric detail, crucial for broad deployment in reconstruction tasks with uncertain or partial information.
By combining interpretable hypothesis-based geometry with GCN-based learnable refinement, Pixel2Mesh++ provides a template for future methodologies targeting high-fidelity, robust, and efficient 3D mesh generation from limited-view visual data (Wen et al., 2022, Wen et al., 2019).