Papers
Topics
Authors
Recent
Search
2000 character limit reached

ShapeUP: Scalable 3D Editing Framework

Updated 7 February 2026
  • ShapeUP is a scalable, image-conditioned 3D editing framework that formulates mesh manipulation as supervised latent-to-latent translation.
  • It employs a diffusion transformer backbone with low-rank adapters to efficiently fuse geometry and image tokens for robust visual control.
  • ShapeUP delivers superior performance and consistency in 3D edits, validated by improved SSIM, LPIPS, and CLIP-I metrics in benchmark studies.

ShapeUP is a scalable, image-conditioned 3D editing framework that formulates precise 3D manipulation as supervised latent-to-latent translation within a native 3D representation. Unlike optimization-based techniques or multi-view 2D propagation, ShapeUP offers a direct, efficient approach built atop a pretrained 3D foundation model, enabling robust visual control, geometric consistency, and inference-time scalability for 3D content creation (Gat et al., 5 Feb 2026).

1. Problem Formulation and Core Approach

ShapeUP addresses the challenge of image-conditioned 3D editing, where the objective is to modify a source 3D mesh SsrcS_{\mathrm{src}} to match a desired edit depicted in a single-view 2D image IeditI_{\mathrm{edit}}, producing a target mesh StgtS_{\mathrm{tgt}} that reflects the intended change while preserving the original asset's geometry and texture. The task is cast as a latent-to-latent mapping: both source and target meshes are encoded into latent representations using a pretrained 3D variational autoencoder (VAE) encoder, EshapeE_{\mathrm{shape}}, and the edit image is embedded via an image encoder, EimgE_{\mathrm{img}}. The editing function, denoted fθf_\theta, maps the concatenation of the source geometry latents and the image latents to a predicted target geometry latent, which is then decoded by the 3D VAE decoder DshapeD_{\mathrm{shape}} to yield the edited mesh. This model allows image-as-prompt control, supports both local and global edits, and achieves mask-free, implicit spatial localization (Gat et al., 5 Feb 2026).

2. Model Architecture: Diffusion Transformer Backbone

The ShapeUP editing module is instantiated as a 3D Diffusion Transformer (DiT) using the Step1X-3D architecture. This backbone processes sets of latent tokens representing geometry (ZZ) and image conditions (CC) using alternating double-stream (separate cross-attention for geometry and conditions) and single-stream (joint processing) transformer blocks. Adaptation to the editing task leverages low-rank adapters (LoRA), which are injected into the cross-attention weights for both block types, thus enabling efficient finetuning while retaining the pretrained generator's capabilities. Source shape tokens and image tokens are concatenated and supplied as conditioning at every attention layer, facilitating explicit 3D supervision in the editing process (Gat et al., 5 Feb 2026).

3. Training Objectives and Loss Functions

ShapeUP optimizes a composite loss that combines geometric and generative criteria:

  • Diffusion Loss (LdiffL_{\mathrm{diff}}): Supervises the DiT denoising process by matching the predicted noise to the true noise incurred by the target latent corrupted according to the classical diffusion process.
  • Latent Reconstruction Loss (LrecL_{\mathrm{rec}}): Ensures that the output latent representation matches the actual target mesh latent.
  • Mesh Reconstruction Loss (LmeshL_{\mathrm{mesh}}): (Optional) Enforces consistency between the decoded mesh and the ground-truth target mesh.

The total geometry loss is expressed as

Lgeom=λdiffLdiff+λrecLrec+λmeshLmeshL_{\mathrm{geom}} = \lambda_{\mathrm{diff}} L_{\mathrm{diff}} + \lambda_{\mathrm{rec}} L_{\mathrm{rec}} + \lambda_{\mathrm{mesh}} L_{\mathrm{mesh}}

with empirical hyperparameters λdiff=1\lambda_{\mathrm{diff}}=1, λrec=0.1\lambda_{\mathrm{rec}}=0.1, λmesh=0.01\lambda_{\mathrm{mesh}}=0.01. Texture synthesis is handled in a separate, analogous stage using the same adapter losses as the base texture model (Gat et al., 5 Feb 2026).

4. Data Generation and Supervision

Supervised training employs a synthetic triplet dataset constructed from Objaverse assets. Each triplet consists of a source mesh, an edited 2D image, and a corresponding target mesh. Two principal protocols are used:

  • Parts Dataset: For ~6870 shapes, semantic parts are systematically added or removed to generate mesh variants.
  • Distant Frames in Motion (DFM): For 560 animated assets, temporally distant keyframes are extracted to capture significant global pose or deformation changes. Each variant is rendered from a random viewpoint as the edited image. DFM samples are upsampled by 3× during training to increase the prevalence of global edits.
  • All meshes are converted to 64k-point point clouds for VAE training. This synthetic design ensures coverage of both localized and global transformations (Gat et al., 5 Feb 2026).

5. Inference and Editing Pipeline

At inference, given a source mesh and an edit image:

  1. Geometry encoding: Eshape(Ssrc)E_{\mathrm{shape}}(S_{\mathrm{src}}) yields ZsrcZ_{\mathrm{src}}.
  2. Image encoding: Eimg(Iedit)E_{\mathrm{img}}(I_{\mathrm{edit}}) yields CimgC_{\mathrm{img}}.
  3. Diffusion sampling: The DiT samples the edited latent using classifier-free guidance, fusing conditioning on both source geometry and edit image with empirically set guidance scales (si=2.5s_i=2.5, ss=3.5s_s=3.5).
  4. Decoding: DshapeD_{\mathrm{shape}} reconstructs the edited mesh from the predicted latent.
  5. Texture transfer: The edited geometry is rendered to surface normals/positions, which are then used in the texture DiT to generate view-consistent RGBs, subsequently baked onto the mesh.

End-to-end inference typically requires 5–10 seconds per shape on high-end GPUs (Gat et al., 5 Feb 2026).

6. Experimental Validation and Benchmarks

Evaluation leverages the “BenchUp” protocol: 24 diverse meshes with 100 image-driven edits each, spanning Parts, Global-Deformation, Global-Pose, and Texture categories. Performance is quantified with condition-alignment (SSIM, LPIPS, CLIP-I, DINO-I, CLIP-Dir) and occluded-region fidelity metrics. ShapeUP consistently achieves superior results across all benchmarks:

Method SSIM↑ LPIPS↓ CLIP-I↑ DINO-I↑ C-Dir↑ occl-CLIP↑ occl-DINO↑
3DEditFormer 0.733 0.270 0.908 0.849 0.441 0.877 0.736
EditP23 0.759 0.254 0.917 0.851 0.455 0.880 0.748
ShapeUP 0.763 0.198 0.943 0.915 0.520 0.928 0.878

User studies confirm a strong preference (>80%) for ShapeUP-edited results over competing approaches. Qualitative inspection highlights ShapeUP’s capacity for crisp part additions, continuous global deformations, and stable view-consistent edits (Gat et al., 5 Feb 2026).

ShapeUP outperforms prior paradigms:

  • Optimization-based (SDS) methods: Prohibitively slow, prone to multi-view inconsistency.
  • Multi-view 2D propagation (EditP23): Suffers from registration artifacts and identity drift, especially for occluded regions.
  • Latent manipulation approaches (3DEditFormer, Nano3D): Constrained by frozen generative priors, insufficient for semantic part/persistence or pose-level edits.

The ShapeUP approach leverages explicit 3D supervision, manipulation in a native latent space, and a diffusion-based transformer to achieve substantially greater editability, precision, and control than these baselines. Potential areas for future extension include broadening to open-world assets, incorporating text or sketch conditioning, and unified treatment of texture and geometry within a single pipeline (Gat et al., 5 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ShapeUP.