Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pixel2Mesh++: Multi-View Mesh Reconstruction

Updated 17 May 2026
  • The paper demonstrates that incorporating cross-view photometric consistency into iterative graph-based deformation yields robust and precise 3D shape recovery.
  • The method leverages multi-view feature aggregation, hypothesis sampling, and a residual GCN scoring network to iteratively refine mesh vertices.
  • Empirical results on ShapeNet show improved Chamfer Distance, F-score, and IoU, underscoring the technique’s efficiency and robustness compared to related approaches.

Pixel2Mesh++ denotes a class of multi-view 3D mesh reconstruction frameworks that advance the Pixel2Mesh philosophy—iterative graph-based mesh deformation—by incorporating cross-view photometric consistency into the deformation reasoning process (Wen et al., 2022, Wen et al., 2019). Rather than regressing mesh geometry directly from color images, Pixel2Mesh++ infers a sequence of local deformations for each vertex, leveraging multiview perceptual features aggregated by explicitly computed camera geometries or pose networks. The architecture is distinguished by its hypothesis sampling mechanism, statistical pooling over multiple views, and graph convolutional scoring, enabling robust, precise, and topology-agnostic 3D shape recovery from few calibrated (or estimated) images.

1. Architectural Framework

Pixel2Mesh++ is organized as a two-stage pipeline:

Stage 0: Coarse Mesh Initialization

Two initialization regimes are supported:

  • MVP2M (Ours-P): Multi-view extension of Pixel2Mesh, deforming an ellipsoid (2466 vertex mesh, fixed connectivity) per object, using features pooled from all views.
  • MVDISN (Ours-D): Multi-view Deep Implicit Surface Network with Marching Cubes extraction (variable topology), generating more diverse initial mesh structures.

Stage 1: Multi-View Deformation Network (MDN)

MDN operates iteratively, refining the mesh via per-vertex local graph convolutions. Each mesh vertex is surrounded by 42 offset hypotheses (from a scaled icosahedron) plus itself, forming a 43-node local graph. For each hypothesized location, early-layer VGG features are sampled per input image and aggregated via view-wise mean, max, and standard deviation statistics, producing hypothesis features invariant to view order or number. These features, concatenated with 3D coordinates, are input to a 6-block residual GCN scoring network that outputs unnormalized scores (one per hypothesis). A softmax over these scores produces a convex combination, repositioning the vertex toward the most photometrically consistent hypothesis. Typically, three MDN iterations are sufficient for convergence.

Optionally, a differentiable rendering-based silhouette refinement step may be applied at test time by minimizing an 2\ell_2 silhouette loss between the mesh projection and observed silhouettes.

2. Multi-View Feature Sampling and Pooling

Pixel2Mesh++ explicitly projects each hypothesis hR3h\in\mathbb{R}^3 for every mesh vertex into each image using either provided or predicted camera intrinsics/extrinsics. The projection for image mm is: [um,vm]=[fxxmzm+cx,fyymzm+cy][u_m, v_m]^\top = \left[\frac{f_x x_m}{z_m} + c_x, \frac{f_y y_m}{z_m} + c_y\right]^\top where [xm,ym,zm]=Rmh+tm[x_m, y_m, z_m]^\top = R_m h + t_m. At these locations, three levels of VGG-16 features ("conv1_2", "conv2_2", "conv3_3") are bilinearly sampled and concatenated, yielding a feature of dimension 339\approx 339. For all MM input views, hypothesis features are aggregated by channel-wise statistics:

  • μ\mu = mean
  • σ\sigma = standard deviation
  • MmaxM^{\max} = max

These statistics, concatenated with the hypothesis location, form the per-hypothesis feature, yielding invariance to both view order and view count.

3. Deformation Reasoning via Local Graph Convolutions

For each vertex, hypothesis features are assembled into a 43-node local graph (42 in the icosahedral shell plus center), with edges mirroring icosahedral structure and spokes to the center. The deformation reasoning module is a stack of six residual GCN layers where each layer performs: hR3h\in\mathbb{R}^30 with skip connections as in the original Pixel2Mesh. After six GraphConv blocks, a linear layer predicts a scalar score hR3h\in\mathbb{R}^31 per hypothesis. Final scores are softmax-normalized: hR3h\in\mathbb{R}^32 yielding the new vertex location as: hR3h\in\mathbb{R}^33 This mechanism constitutes a differentiable "soft-argmax" over the local hypothesis cloud, producing robust and stable mesh refinements.

4. Optimization Objectives and Training Regimes

The supervised loss function combines several terms: hR3h\in\mathbb{R}^34 with coefficients hR3h\in\mathbb{R}^35. Details:

  • Chamfer Distance hR3h\in\mathbb{R}^36: Computed between point clouds resampled uniformly from predicted and ground-truth mesh surfaces (4000 points per mesh; sampling uses hR3h\in\mathbb{R}^37, hR3h\in\mathbb{R}^38).
  • Normal Consistency hR3h\in\mathbb{R}^39: Penalizes angular deviation between predicted and ground-truth face normals.
  • Edge Length Regularization mm0: Constraints edge lengths to discourage degenerate faces.
  • Laplacian Smoothness mm1: Encourages local surface smoothness via deviation of each vertex from the mean position of its neighbors.
  • Silhouette Loss mm2 (optional at test): Minimizes mm3 difference between mesh silhouette and observed mask under differentiable rendering.

Training is performed on ShapeNet Core v2 (13 classes, 50k models, train/test by 3D-R2N2), using Adam optimizer, batch size 1, typically 3 input views per object per iteration, and (for MDN) an additional 20 epochs at mm4 learning rate following the coarse shape initialization.

5. Empirical Performance and Robustness

Pixel2Mesh++ demonstrates state-of-the-art performance for multi-view mesh reconstruction on ShapeNet. Principal quantitative metrics include Chamfer Distance (mm5), mm6-score at various thresholds (mm7), and volumetric IoU (mm8). Representative results (mean over 13 categories) are:

Model Chamfer (×10⁻³) F(τ) F(2τ) IoU
MVP2M 0.456 61.20 76.94 0.411
Ours-P 0.381 67.23 81.22 0.436
Ours-D 0.390 75.24 85.04 0.508

Qualitative assessments confirm recovery of thin structures (chair legs, lamp necks), fidelity from arbitrary viewpoints, and resilience to poor mesh initializations (including noisy or marching cubes-based meshes). Test-time silhouette refinement yields additional performance gains (+1–2% mm9). Compared to neural renderer-based MVS methods, Pixel2Mesh++ achieves similar accuracy with far lower inference times (seconds versus minutes for 24-view IDR). The architecture accommodates plug-and-play camera pose estimation via a dedicated "Camera Pose Network," and is robust to camera pose errors: switching from ground-truth to estimated poses produces less than 1% drop in [um,vm]=[fxxmzm+cx,fyymzm+cy][u_m, v_m]^\top = \left[\frac{f_x x_m}{z_m} + c_x, \frac{f_y y_m}{z_m} + c_y\right]^\top0.

Cross-category generalization (e.g., holding out one class during MDN training) incurs only modest loss, and an MDN trained on one class can improve [um,vm]=[fxxmzm+cx,fyymzm+cy][u_m, v_m]^\top = \left[\frac{f_x x_m}{z_m} + c_x, \frac{f_y y_m}{z_m} + c_y\right]^\top1 on others by +5–10%. Performance increases with number of views at test time (e.g., [um,vm]=[fxxmzm+cx,fyymzm+cy][u_m, v_m]^\top = \left[\frac{f_x x_m}{z_m} + c_x, \frac{f_y y_m}{z_m} + c_y\right]^\top2: 64.5 with two views [um,vm]=[fxxmzm+cx,fyymzm+cy][u_m, v_m]^\top = \left[\frac{f_x x_m}{z_m} + c_x, \frac{f_y y_m}{z_m} + c_y\right]^\top3 68.3 with five).

Pixel2Mesh++ advances upon:

  • Pixel2Mesh (P2M): Original single-view mesh deformation via GCN, extended here to pool multi-view perception with explicit geometric reasoning [Wang et al. ECCV'18].
  • MVP2M: Naive multi-view extension pooling features across images but lacking hypothesis sampling/statistical aggregation.
  • MVDISN: Implicit surface prediction with mesh extraction via Marching Cubes for initialization.
  • Neural Renderer Approaches (DVR, IDR): Pixel2Mesh++ achieves competitive reconstruction with dramatically lower computational cost and without requiring dense multi-view supervision or extensive rendering loops.

A significant distinguishing factor is the use of cross-view feature-statistics pooling at each local hypothesis, which integrates classical multi-view geometry concepts (e.g., spatial consistency, soft correspondence) in an end-to-end GCN framework for direct mesh output.

7. Generalization and Practical Significance

Pixel2Mesh++ demonstrates notable robustness to dataset domain shift (e.g., transfer from ShapeNet to ABC CAD models) and mesh initializations corrupted by noise. The architecture's modularity allows seamless integration of explicit or learned camera pose, variable input view counts, and test-time optimization stages. The iterative hypothesis sampling and statistical aggregation enable the network to correct both global misalignment and recover fine-grained geometric detail, crucial for broad deployment in reconstruction tasks with uncertain or partial information.

By combining interpretable hypothesis-based geometry with GCN-based learnable refinement, Pixel2Mesh++ provides a template for future methodologies targeting high-fidelity, robust, and efficient 3D mesh generation from limited-view visual data (Wen et al., 2022, Wen et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pixel2Mesh++.