Native 3D Editing Framework

Updated 3 July 2026

Native 3D Editing Framework is a methodology that applies localized or global edits directly in 3D representations to maintain geometry and appearance coherence.
It leverages unified latents, dense point maps, and multimodal tokens to enforce high-fidelity edits while preserving unedited regions.
These frameworks balance edit fidelity, preservation of source structure, and multi-view consistency, significantly reducing test-time computational costs.

Searching arXiv for papers on native 3D editing frameworks and related methods. Native 3D editing framework denotes a class of methods that modifies existing 3D scenes or assets directly in native 3D representations, rather than editing rendered views independently and then lifting the result back into 3D. Recent work characterizes earlier alternatives as per-scene optimization over explicit 3D representations, cascaded edit-and-reconstruct pipelines, or 2D-lifting strategies, and associates them with high test-time cost, blurry textures, limited 3D awareness, structural inconsistencies, and multi-view drift (Zhu et al., 11 Jun 2026, Zhu et al., 14 May 2026). In response, native frameworks operate on unified RGB-geometry latents, dense point maps, sparse voxel or structured latents, Gaussian primitives, mesh-texture tokenizations, or shared multimodal token sequences, with the explicit aim of preserving unedited content while enforcing geometric coherence and multi-view consistency (Gat et al., 5 Feb 2026, Ye et al., 2 Apr 2026, Liu et al., 5 Jun 2026).

1. Problem formulation and scope

The central problem is to apply localized or global edits to a 3D scene or asset while maintaining the identity of untouched geometry and appearance. In the recent literature, this objective appears in several closely related forms: feed-forward scene editing in a unified latent space, residual-field prediction on a frozen 3D backbone, supervised latent-to-latent translation, explicit Gaussian manipulation, inversion-based latent editing, and conditional autoregressive or multimodal generation (Zhu et al., 11 Jun 2026, Zhu et al., 14 May 2026, Gat et al., 5 Feb 2026, Yan et al., 2024).

A defining property of these frameworks is that the edited variable is itself 3D. JointEdit3D edits a unified RGB-geometry latent by asymmetric latent inpainting from a single edited RGB reference frame (Zhu et al., 11 Jun 2026). VGGT-Edit predicts a dense residual displacement field over a base point map produced by a frozen feed-forward backbone (Zhu et al., 14 May 2026). ShapeUP edits a source 3D shape using a 3D Diffusion Transformer conditioned on an edited 2D image (Gat et al., 5 Feb 2026). 3DSceneEditor manipulates Gaussian splats directly through semantic labeling, zero-shot grounding, and Gaussian-level operators for deletion, recoloring, addition, movement, and replacement (Yan et al., 2024). VoxHammer, VS3D, and 3D-LATTE move the edit inside the sampler of a pretrained native 3D diffusion or rectified-flow model, replacing or modulating latent features, attention maps, or velocity fields rather than reconstructing 3D from independently edited 2D views (Li et al., 26 Aug 2025, Liu et al., 8 May 2026, Parelli et al., 29 Aug 2025).

This shared problem setting produces a technical emphasis on three constraints that recur across papers: edit fidelity in the edited region, preservation of unedited content, and cross-view or 3D structural coherence. A plausible implication is that “native” has come to denote where the edit is represented and enforced, rather than a single fixed choice of architecture, supervision, or conditioning modality.

2. Representational substrates

Recent native 3D editing frameworks differ most clearly in their choice of 3D representation and conditioning interface.

Method	Native representation	Conditioning signal
JointEdit3D	unified RGB-geometry latent $z(V) = [z^{rgb}; z^{geo}]$	one edited RGB reference frame $I_e$ plus optional text $p$
VGGT-Edit	dense point map $P_{base}$ and residual displacement field $\Delta P$	text instruction with depth-synchronized text injection
ShapeUP	shape-VAE latent edited by a 3D Diffusion Transformer	source 3D mesh and a single edited 2D image
VF-Editor	Gaussian attribute variations $\Delta$ over native Gaussian primitives	text instruction distilled from 2D editing knowledge
Native3D	unified mesh-texture joint representation $Z_{scene}$	text- or sketch-based instruction
Omni123	shared discrete sequence space for text, images, and 3D mesh tokens	source mesh tokens and natural-language edit instruction

JointEdit3D adopts perhaps the most explicit joint representation. A pretrained video VAE produces $z^{rgb}$ , a frozen VGGT encoder plus Geometry Adapter encoder produces $z^{geo}$ , and the two are concatenated along width to form $z(V) \in \mathbb R^{C \times F \times H \times 2W}$ (Zhu et al., 11 Jun 2026). This representation is designed so that appearance synthesis and geometry prediction are coupled during editing.

VGGT-Edit begins from a frozen, generalizable feed-forward model, described as $I_e$ 0 – Permutation-Equivariant Visual Geometry Learning, whose base output is a dense point map $I_e$ 1. Editing is parameterized as a masked residual transform,

$I_e$ 2

so geometry is altered only where the edit mask is active (Zhu et al., 14 May 2026).

Other frameworks extend the native 3D idea beyond latent diffusion. Native3D models a scene as

$I_e$ 3

then fuses geometry and texture embeddings into a joint scene latent $I_e$ 4 with a Transformer-based scene encoder (Liu et al., 5 Jun 2026). Omni123 instead tokenizes text, images, and 3D meshes into a shared autoregressive sequence space, representing instruction-based editing as conditional generation over discrete 3D mesh tokens (Ye et al., 2 Apr 2026). EVA01 moves further toward multimodal large-model integration by extending a Mixture-of-Transformers architecture to native 3D mesh understanding, generation, and context-aware editing (Yang et al., 16 May 2026).

3. Core architectural mechanisms

A central design axis is how a framework injects edit information while retaining source-scene or source-object structure. JointEdit3D uses asymmetric single-frame latent inpainting: at test time it observes a source video $I_e$ 5 and one edited reference frame $I_e$ 6, encodes $I_e$ 7 into an edited RGB latent, places that slice into a zero-filled unified tensor, and predicts the remaining RGB views and the geometry half under source-scene anchoring (Zhu et al., 11 Jun 2026). Its SceneAnchor Branch is added to a frozen pretrained Wan backbone and injects source-scene structure by residual corrections rather than by direct copying. JointEdit3D also introduces edit/background-aware losses that decompose training into edited and background regions, with default hyperparameters $I_e$ 8, $I_e$ 9, $p$ 0, $p$ 1, and $p$ 2 (Zhu et al., 11 Jun 2026).

VGGT-Edit addresses the same preservation-versus-editability tension through synchronized conditioning and residual prediction. Text is injected at layers

$p$ 3

to align semantic guidance with the depths where spatial geometry is formed, while a view-aware importance weight $p$ 4 modulates the text keys and values so occluded or boundary views are suppressed (Zhu et al., 14 May 2026). The edit is then confined to a residual transformation head, and training uses a five-term objective

$p$ 5

with typical weights $p$ 6, $p$ 7, $p$ 8, $p$ 9, and $P_{base}$ 0 (Zhu et al., 14 May 2026).

Feed-forward object editing methods introduce related mechanisms at different granularities. “Native 3D Editing with Full Attention” represents source and noisy target latents in a Full-DiT, and studies two conditioning strategies: a conventional cross-attention mechanism and a 3D token concatenation approach. The reported conclusion is that token concatenation is more parameter-efficient and achieves superior performance (Cai et al., 21 Nov 2025). VF-Editor predicts a global variation field from randomized Gaussian tokens and then decodes first the positional variation $P_{base}$ 1 and then the remaining attribute variations for scale, opacity, color, and rotation. Its one-shot edit is

$P_{base}$ 2

and the edit is executed in a single feed-forward pass on native Gaussian primitives (Qin et al., 12 Feb 2026).

These mechanisms share a common structural principle: the source 3D representation is not merely a reconstruction prior, but an explicit conditioning signal or anchor that constrains the permissible edit trajectory.

4. Inversion-based, training-free, and agentic variants

A second major branch of the literature performs native 3D editing during denoising or ODE sampling rather than by feed-forward prediction. VoxHammer first inverts a 3D asset into a pretrained 3D latent diffusion model, caches intermediate latents and all attention key/value tensors, and then re-denoises while replacing latent features and key/value tensors in preserved regions with their cached originals (Li et al., 26 Aug 2025). In the structure stage it blends latents with a binary mask; in the sparse-latent stage it copies preserved coordinates $P_{base}$ 3 exactly. This replacement-based design is intended to preserve both contextual features and coherent integration of edited parts.

VS3D intervenes even more explicitly inside the rectified-flow sampler. Reconstruction-Anchored Source Injection absorbs identity leakage by replacing the unconditional embedding with a per-step, asset-specific anchor; Partial-Mean Guidance amplifies the edit direction only where a consistent edit exists; Twin-Agreement Residual injection restores source identity token by token in geometry and material stages (Liu et al., 8 May 2026). The framework is described as inversion-free, training-free, and mask-free.

3D-LATTE also edits directly in native 3D latent space but emphasizes attention control. It blends source-prompt and edit-prompt 3D attention maps, adds geometry-aware regularization over opacity and covariance, applies spectral modulation in the Fourier domain, and concludes with a 3D enhancement refinement loop (Parelli et al., 29 Aug 2025). Vinedresser3D combines native latent editing with agentic planning: a Gemini-2.5-flash MLLM produces decomposed structural and appearance-level guidance, selects a view, uses NanoBanana to create an edited reference image, localizes the edit region with PartField, and then edits Trellis latents through interleaved text and image sampling with mask injection at every denoising step (Chi et al., 23 Feb 2026).

The literature therefore does not use “native” to mean the absence of image or language guidance. JointEdit3D observes a single edited RGB reference latent (Zhu et al., 11 Jun 2026), ShapeUP is image-conditioned (Gat et al., 5 Feb 2026), VF-Editor is distilled from 2D editing knowledge (Qin et al., 12 Feb 2026), and Vinedresser3D explicitly applies an image editing model to obtain visual guidance (Chi et al., 23 Feb 2026). This suggests that native 3D editing is primarily defined by where the edit is enforced—inside a 3D representation, latent, sampler, or primitive set—rather than by the exclusion of 2D signals.

5. Datasets, benchmarks, and evaluation protocols

The recent shift toward native 3D editing has been accompanied by explicit dataset construction, largely because several papers identify a lack of paired resources for standardized evaluation.

Dataset or benchmark	Size / split	Notable properties
SceneEdit3D-15K	15 K paired before/after video samples; 13,799 train and 1,520 test/val	source video, edited target video, one edited reference RGB frame, natural-language instruction, renderer-provided 3D annotations
SceneEdit3D-Bench	100 carefully curated samples	Delete 29, Add 29, Move 14, Appearance 14, Multi-op 14
DeltaScene	~100 k pairs; 95 k train / 500 manually-verified test	Add, Delete, Modify, Move, and compositions; automated 4-stage pipeline with 3D agreement filtering
BenchUp	24 diverse meshes × 100 edit conditions	Parts, Global-Deformation, Global-Pose, Global-Texture
Edit3D-Bench	100 high-quality 3D models; 3 distinct editing prompts each	human-annotated 3D mask, corresponding 2D mask, edited 2D image
Nano3D-Edit-100k	100 000 pairs	~33 % Add, 33 % Remove, 34 % Replace

SceneEdit3D-15K and SceneEdit3D-Bench were introduced together with JointEdit3D specifically to provide paired editing samples and renderer-provided 3D annotations for scene editing (Zhu et al., 11 Jun 2026). DeltaScene was constructed for VGGT-Edit through an automated pipeline that includes instruction generation and verification, 3D mask refinement, sequential multi-view editing, and viewpoint selection with Re-projection Fidelity (Zhu et al., 14 May 2026). ShapeUP’s BenchUp evaluates both edit alignment and occluded-region fidelity (Gat et al., 5 Feb 2026). VoxHammer’s Edit3D-Bench emphasizes preservation quality in the unedited region, with human-labeled 3D editing regions (Li et al., 26 Aug 2025). Nano3D-Edit-100k is positioned as paired data for future feed-forward 3D editors (Ye et al., 16 Oct 2025).

Evaluation protocols reflect the dual nature of the task. Scene-based works report PSNR, LPIPS, and point-cloud metrics such as Accuracy, Completeness, Chamfer, and F-score (Zhu et al., 11 Jun 2026). Text-guided frameworks use CLIP-derived scores, C-FID, C-KID, DINO-I, CLIP-Dir, FID, and FVD (Zhu et al., 14 May 2026, Gat et al., 5 Feb 2026, Cai et al., 21 Nov 2025). Preservation-oriented benchmarks add masked PSNR, SSIM, LPIPS, or Chamfer on unchanged regions (Li et al., 26 Aug 2025). This metric diversity indicates that no single benchmark yet captures edit fidelity, 3D consistency, and background preservation simultaneously.

6. Reported performance and efficiency

On the shared deletion task evaluated on SceneEdit3D-Bench and 360-USID, JointEdit3D reports 31.92/0.151 PSNR/LPIPS on SceneEdit3D-Bench and 18.57/0.3426 on 360-USID, compared with 24.86/0.307 and 17.78/0.3655 for Omni-3DEdit. Its geometry comparison also reports JointEdit3D (full) with Accuracy $P_{base}$ 4, Completeness $P_{base}$ 5, Chamfer $P_{base}$ 6, and F-score $P_{base}$ 7. For 49-frame inference on H200, JointEdit3D reports 11.94 s total time and 29.1 GB GPU memory, compared with 43.25 s and 43.8 GB for MVInpainter and 237.15 s and 56.4 GB for Omni-3DEdit (Zhu et al., 11 Jun 2026).

VGGT-Edit reports on the DeltaScene test set, with 500 pairs, a CLIP Score of 30.2, C-FID of 122.4, C-KID of 0.048, and time of ~5 s. The same table reports Omni-3DEdit at 28.5, 128.1, 0.085, and ~115 s, and NoPoSplat at 25.8, 135.4, 0.112, and ~20 s (Zhu et al., 14 May 2026). ShapeUP reports on BenchUp: SSIM 0.763, LPIPS 0.198, CLIP-I 0.943, DINO-I 0.915, CLIP-Dir 0.520, Occluded CLIP-I 0.928, and Occluded DINO-I 0.878, together with a user study of 664 pairwise comparisons over 34 participants in which ShapeUP was preferred ≈75% of the time against each baseline, with 95% CI ±4% (Gat et al., 5 Feb 2026).

“Native 3D Editing with Full Attention” reports average rendered-view metrics of FID 91.9, FVD 286.5, and CLIP 0.249 for its token-concatenation model, with ≈20 s inference time on a single A100 GPU (Cai et al., 21 Nov 2025). VF-Editor reports runtime of approximately 0.3 s per edit versus 200–460 s for indirect 2D-to-3D methods, together with highest IS, highest $P_{base}$ 8, highest $P_{base}$ 9, and highest IAA in its multi-domain comparison; it also reports Chamfer $\Delta P$ 0 and F-score $\Delta P$ 1 under non-geometric edits (Qin et al., 12 Feb 2026). VoxHammer reports CD 0.012 versus 0.016, masked PSNR 41.68 dB versus 27.70 dB, masked SSIM 0.994 versus 0.957, LPIPS 0.027 versus 0.067, FID 23.05 versus 45.93, FVD 187.8 versus 450.1, DINO-I 0.947 versus 0.903, and CLIP-T 0.287 versus 0.260 when compared to the best baseline Instant3DiT (Li et al., 26 Aug 2025).

Because these numbers are reported on different datasets and tasks, they do not define a single cross-paper ranking. They do, however, consistently support the same empirical pattern: feed-forward or feature-replacement native 3D editors are evaluated as substantially more efficient than optimization-heavy or multi-view 2D-lifting baselines, while also improving preservation and multi-view coherence on their own benchmarks.

7. Limitations, misconceptions, and broader directions

The limitations reported by the literature are also highly consistent. JointEdit3D remains reference-guided: quality depends on the user-provided edited RGB frame, and errors or inconsistent edits in that frame propagate across views and into geometry (Zhu et al., 11 Jun 2026). ShapeUP requires paired supervision in the form $\Delta P$ 2, is biased toward closed, object-centric meshes, and remains diffusion-limited at inference time (Gat et al., 5 Feb 2026). Nano3D currently handles only localized “Add/Remove/Replace” edits, while global shape or topology transformations remain challenging (Ye et al., 16 Oct 2025). 3DGS-Drag notes artifacts under too aggressive drags, failures for small objects or highly complex scenes, and meaningless textures on unseen back sides (Dong et al., 12 Jan 2026). 3D-LATTE identifies reliance on accurate multi-view masks and computational overhead from iterative 3D refinement (Parelli et al., 29 Aug 2025).

A recurrent misconception is that “native 3D” is synonymous with “fully text-only” or “without any 2D component.” The papers do not support that interpretation. Several methods remain native at the representation level while depending on a single edited image, a rendered reference view, distilled 2D editing knowledge, or 2D-generated guidance (Zhu et al., 11 Jun 2026, Gat et al., 5 Feb 2026, Qin et al., 12 Feb 2026, Chi et al., 23 Feb 2026). Another misconception is that the term implies a single representation family. In practice, the literature spans Gaussian splats, sparse voxels, point maps, unified RGB-geometry latents, mesh-texture tokens, and autoregressive discrete mesh tokens (Yan et al., 2024, Li et al., 26 Aug 2025, Liu et al., 5 Jun 2026, Ye et al., 2 Apr 2026).

At the system level, current work points toward a broader unification of editing, generation, and multimodal reasoning. Omni123 frames instruction-based editing as conditional generation in a shared text-image-3D sequence space and presents this as a scalable path toward multimodal 3D world models (Ye et al., 2 Apr 2026). EVA01 extends the modality boundary of MLLMs to native 3D mesh understanding, generation, and context-aware editing through a Mixture-of-Transformers architecture (Yang et al., 16 May 2026). Native3D couples mesh and texture modeling with semantic alignment and reports editing flexibility in an end-to-end scene framework (Liu et al., 5 Jun 2026). This suggests that native 3D editing is evolving from a narrowly defined edit module into a component of unified multimodal 3D systems that combine understanding, generation, and stateful editing in a shared representational space.