MeshPad: Interactive 3D Mesh Editing
- MeshPad is a generative framework for interactive 3D mesh creation, using user-provided 2D sketches to guide localized deletion and addition operations.
- It employs an autoregressive Transformer and a vertex-aligned speculator to accelerate token generation and ensure consistent, fine-grained mesh edits.
- Empirical results show MeshPad outperforms previous methods on metrics like Chamfer Distance, FID, and LPIPS, delivering superior quality and interactive runtimes.
MeshPad is a generative framework for interactive 3D mesh creation and editing, directly conditioned on user-provided 2D sketches. By decomposing editing into deletion and addition operations, MeshPad achieves fine-grained, region-specific manipulation of artist-designed triangle meshes, enabling construction of complex 3D forms through an iterative sketch-based interface. The system combines an autoregressive triangle-sequence Transformer with a vertex-aligned speculative prediction module, leading to consistent edits and interactive runtimes while outperforming previous sketch-to-mesh methods in both quantitative and perceptual evaluations (Li et al., 3 Mar 2025).
1. Motivation and Problem Setting
Traditional generative mesh synthesis techniques, including MeshGPT and MeshAnything, are capable of producing plausible 3D models from input prompts. However, these methods lack support for localized, region-specific editing; they typically require full mesh regeneration for adjustments, disrupting workflow continuity and discarding unedited details. Artistic workflows, in contrast, often depend on iterative manipulation—sketching, modifying, and refining certain regions without altering the remainder. MeshPad addresses this gap by enabling interactive, sketch-conditioned edits such that only specified mesh regions are changed while the rest of the structure is preserved. Sketches serve as a highly expressive and familiar modality for conveying 3D intent, permitting precise yet intuitive content manipulation.
2. Mesh Representation and Tokenization
MeshPad represents a triangular mesh as an ordered set of faces: To facilitate generative modeling, a tokenizer encodes into a sequence of discrete tokens: where each vertex coordinate (, , ) is quantized to a vocabulary index and structural control tokens (e.g., ) delimit triangle fans or connectivity groups. The inverse mapping reconstructs the mesh. This discrete, autoregressive representation enables sequence-based Transformer architectures to operate on 3D geometric content.
3. Generative Transformer Architecture
MeshPad employs an Open Pre-trained Transformer (OPT) as its core backbone. The input sequence comprises (a) sketch tokens—extracted by a frozen RADIO image encoder, (b) tokens representing the unedited mesh region (), and (c) previously generated tokens for newly added mesh regions ():
- Addition (autoregressive):
producing the output sequence that is integrated with the preserved mesh to yield .
- Deletion (classification): OPT, equipped with bi-directional attention, labels each vertex token for retention or removal. A small classification head, pooling the three coordinate-token embeddings per vertex, outputs for each .
This dual-mode model supports both sequence generation for mesh addition and classification for localized deletion.
4. Edit Operations and Workflow
MeshPad interprets sketch edits as a partition of sketch strokes and mesh regions into those to be kept (, ) and those to be added/removed (, ). The deletion operation predicts removable vertices via a threshold on , then: Supervision utilizes binary cross-entropy loss per vertex. The addition operation, conditioned on the preserved mesh region and the current sketch, auto-regressively generates token sequences for new geometry, with cross-entropy loss against ground-truth tokens.
This "delete-then-add" paradigm enables iteration and localized refinement, in alignment with artistic workflows.
5. Vertex-Aligned Speculative Prediction
To accelerate the autoregressive decoding of mesh tokens, MeshPad introduces a vertex-aligned "speculator" head—a multilayer perceptron mapped to vertex token positions. Given the hidden state at an -coordinate token , the speculator predicts the corresponding and tokens: This enables decoding in units of vertices (3 tokens per vertex) rather than single tokens, permitting speculative generation of coordinates in a single pass. Joint training ensures the transformer's hidden states adapt to this structure. Empirically, the approach achieves a increase in token generation throughput (from T/s to T/s on an A100 GPU) without quality degradation.
6. Training Objectives and Evaluation Metrics
MeshPad is trained in a self-supervised manner, using:
- Addition head: standard cross-entropy over the token vocabulary for next-token prediction.
- Deletion head: binary cross-entropy per vertex,
- Speculator: cross-entropy for the two predicted tokens; the OPT head is used for control tokens and -coordinates.
Evaluation leverages multiple metrics:
- Chamfer Distance:
7. Empirical Results and Analysis
On the ShapeNet test set, MeshPad demonstrates substantial improvements over prior art:
- Chamfer Distance (×10⁻³): LAS 22.46, SENS 8.95, MeshPad 6.20—representing a ≈22% reduction versus the best previous method.
- FID: LAS 47.1, SENS 81.9, MeshPad 9.4.
- LPIPS/CLIP: MeshPad achieves the lowest LPIPS and highest CLIP similarity to the input.
User studies with 35 participants rate MeshPad at 4.3/5 for mesh quality and 4.2–4.3/5 for edit consistency, exceeding baseline methods (2.7–3.5). Binary preference tests select MeshPad over LAS/SENS (and MeshAnythingV2 post-processing variants) 83–96% of the time, both for generation and editing.
8. Strengths, Limitations, and Future Prospects
MeshPad’s two-stage editing operation (deletion plus addition) enables maintenance of unedited mesh regions and incremental, part-wise construction. The vertex-aligned speculator yields interactive runtimes measured in seconds per edit, compatible with creative iterative workflows. Sketch-based conditioning provides precise, accessible user control over mesh geometry.
Key limitations include a Transformer-imposed cap on sequence length, restricting mesh complexity to approximately 768 faces and precluding extremely large-scale scene synthesis. The reliance on a fixed token vocabulary and quantization may reduce capacity for ultra-fine geometric detail. Prospective directions include hierarchical or sparse mesh representations (e.g., tile-based meshes) to address scalability while preserving interactivity (Li et al., 3 Mar 2025).