ShapeUP: Scalable 3D Editing Framework
- ShapeUP is a scalable, image-conditioned 3D editing framework that formulates mesh manipulation as supervised latent-to-latent translation.
- It employs a diffusion transformer backbone with low-rank adapters to efficiently fuse geometry and image tokens for robust visual control.
- ShapeUP delivers superior performance and consistency in 3D edits, validated by improved SSIM, LPIPS, and CLIP-I metrics in benchmark studies.
ShapeUP is a scalable, image-conditioned 3D editing framework that formulates precise 3D manipulation as supervised latent-to-latent translation within a native 3D representation. Unlike optimization-based techniques or multi-view 2D propagation, ShapeUP offers a direct, efficient approach built atop a pretrained 3D foundation model, enabling robust visual control, geometric consistency, and inference-time scalability for 3D content creation (Gat et al., 5 Feb 2026).
1. Problem Formulation and Core Approach
ShapeUP addresses the challenge of image-conditioned 3D editing, where the objective is to modify a source 3D mesh to match a desired edit depicted in a single-view 2D image , producing a target mesh that reflects the intended change while preserving the original asset's geometry and texture. The task is cast as a latent-to-latent mapping: both source and target meshes are encoded into latent representations using a pretrained 3D variational autoencoder (VAE) encoder, , and the edit image is embedded via an image encoder, . The editing function, denoted , maps the concatenation of the source geometry latents and the image latents to a predicted target geometry latent, which is then decoded by the 3D VAE decoder to yield the edited mesh. This model allows image-as-prompt control, supports both local and global edits, and achieves mask-free, implicit spatial localization (Gat et al., 5 Feb 2026).
2. Model Architecture: Diffusion Transformer Backbone
The ShapeUP editing module is instantiated as a 3D Diffusion Transformer (DiT) using the Step1X-3D architecture. This backbone processes sets of latent tokens representing geometry () and image conditions () using alternating double-stream (separate cross-attention for geometry and conditions) and single-stream (joint processing) transformer blocks. Adaptation to the editing task leverages low-rank adapters (LoRA), which are injected into the cross-attention weights for both block types, thus enabling efficient finetuning while retaining the pretrained generator's capabilities. Source shape tokens and image tokens are concatenated and supplied as conditioning at every attention layer, facilitating explicit 3D supervision in the editing process (Gat et al., 5 Feb 2026).
3. Training Objectives and Loss Functions
ShapeUP optimizes a composite loss that combines geometric and generative criteria:
- Diffusion Loss (): Supervises the DiT denoising process by matching the predicted noise to the true noise incurred by the target latent corrupted according to the classical diffusion process.
- Latent Reconstruction Loss (): Ensures that the output latent representation matches the actual target mesh latent.
- Mesh Reconstruction Loss (): (Optional) Enforces consistency between the decoded mesh and the ground-truth target mesh.
The total geometry loss is expressed as
with empirical hyperparameters , , . Texture synthesis is handled in a separate, analogous stage using the same adapter losses as the base texture model (Gat et al., 5 Feb 2026).
4. Data Generation and Supervision
Supervised training employs a synthetic triplet dataset constructed from Objaverse assets. Each triplet consists of a source mesh, an edited 2D image, and a corresponding target mesh. Two principal protocols are used:
- Parts Dataset: For ~6870 shapes, semantic parts are systematically added or removed to generate mesh variants.
- Distant Frames in Motion (DFM): For 560 animated assets, temporally distant keyframes are extracted to capture significant global pose or deformation changes. Each variant is rendered from a random viewpoint as the edited image. DFM samples are upsampled by 3× during training to increase the prevalence of global edits.
- All meshes are converted to 64k-point point clouds for VAE training. This synthetic design ensures coverage of both localized and global transformations (Gat et al., 5 Feb 2026).
5. Inference and Editing Pipeline
At inference, given a source mesh and an edit image:
- Geometry encoding: yields .
- Image encoding: yields .
- Diffusion sampling: The DiT samples the edited latent using classifier-free guidance, fusing conditioning on both source geometry and edit image with empirically set guidance scales (, ).
- Decoding: reconstructs the edited mesh from the predicted latent.
- Texture transfer: The edited geometry is rendered to surface normals/positions, which are then used in the texture DiT to generate view-consistent RGBs, subsequently baked onto the mesh.
End-to-end inference typically requires 5–10 seconds per shape on high-end GPUs (Gat et al., 5 Feb 2026).
6. Experimental Validation and Benchmarks
Evaluation leverages the “BenchUp” protocol: 24 diverse meshes with 100 image-driven edits each, spanning Parts, Global-Deformation, Global-Pose, and Texture categories. Performance is quantified with condition-alignment (SSIM, LPIPS, CLIP-I, DINO-I, CLIP-Dir) and occluded-region fidelity metrics. ShapeUP consistently achieves superior results across all benchmarks:
| Method | SSIM↑ | LPIPS↓ | CLIP-I↑ | DINO-I↑ | C-Dir↑ | occl-CLIP↑ | occl-DINO↑ |
|---|---|---|---|---|---|---|---|
| 3DEditFormer | 0.733 | 0.270 | 0.908 | 0.849 | 0.441 | 0.877 | 0.736 |
| EditP23 | 0.759 | 0.254 | 0.917 | 0.851 | 0.455 | 0.880 | 0.748 |
| ShapeUP | 0.763 | 0.198 | 0.943 | 0.915 | 0.520 | 0.928 | 0.878 |
User studies confirm a strong preference (>80%) for ShapeUP-edited results over competing approaches. Qualitative inspection highlights ShapeUP’s capacity for crisp part additions, continuous global deformations, and stable view-consistent edits (Gat et al., 5 Feb 2026).
7. Comparison to Related Frameworks and Limitations
ShapeUP outperforms prior paradigms:
- Optimization-based (SDS) methods: Prohibitively slow, prone to multi-view inconsistency.
- Multi-view 2D propagation (EditP23): Suffers from registration artifacts and identity drift, especially for occluded regions.
- Latent manipulation approaches (3DEditFormer, Nano3D): Constrained by frozen generative priors, insufficient for semantic part/persistence or pose-level edits.
The ShapeUP approach leverages explicit 3D supervision, manipulation in a native latent space, and a diffusion-based transformer to achieve substantially greater editability, precision, and control than these baselines. Potential areas for future extension include broadening to open-world assets, incorporating text or sketch conditioning, and unified treatment of texture and geometry within a single pipeline (Gat et al., 5 Feb 2026).