Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 88 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 17 tok/s Pro

GPT-5 High 17 tok/s Pro

GPT-4o 73 tok/s Pro

GPT OSS 120B 464 tok/s Pro

Kimi K2 190 tok/s Pro

2000 character limit reached

VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space (2508.19247v1)

Published 26 Aug 2025 in cs.CV

Abstract: 3D local editing of specified regions is crucial for game industry and robot interaction. Recent methods typically edit rendered multi-view images and then reconstruct 3D models, but they face challenges in precisely preserving unedited regions and overall coherence. Inspired by structured 3D generative models, we propose VoxHammer, a novel training-free approach that performs precise and coherent editing in 3D latent space. Given a 3D model, VoxHammer first predicts its inversion trajectory and obtains its inverted latents and key-value tokens at each timestep. Subsequently, in the denoising and editing phase, we replace the denoising features of preserved regions with the corresponding inverted latents and cached key-value tokens. By retaining these contextual features, this approach ensures consistent reconstruction of preserved areas and coherent integration of edited parts. To evaluate the consistency of preserved regions, we constructed Edit3D-Bench, a human-annotated dataset comprising hundreds of samples, each with carefully labeled 3D editing regions. Experiments demonstrate that VoxHammer significantly outperforms existing methods in terms of both 3D consistency of preserved regions and overall quality. Our method holds promise for synthesizing high-quality edited paired data, thereby laying the data foundation for in-context 3D generation. See our project page at https://huanngzh.github.io/VoxHammer-Page/.

Collections

Summary

The paper introduces a training-free method for precise 3D local asset editing using a two-stage latent inversion process and contextual feature replacement.
The approach leverages a pretrained structured 3D diffusion model (TRELLIS) to blend edited and unedited regions, achieving high-fidelity reconstruction with superior metrics such as CD 0.012 and PSNR 41.68.
The method generalizes to diverse 3D editing tasks and outperforms existing solutions in user studies, reducing processing time from minutes to seconds.

VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space

Introduction

VoxHammer introduces a training-free framework for precise and coherent local editing of 3D assets directly in native 3D latent space. The method leverages a pretrained structured 3D latent diffusion model (TRELLIS) and a two-stage process: (1) precise 3D inversion and (2) denoising-based editing with contextual feature replacement. This approach addresses the limitations of prior 3D editing pipelines, which either optimize 3D representations via Score Distillation Sampling (SDS) or edit multi-view images followed by 3D reconstruction, both suffering from inefficiency and poor preservation of unedited regions. VoxHammer achieves high-fidelity editing without retraining, enabling efficient synthesis of paired edited data and laying the groundwork for in-context 3D generation.

Methodology

Pipeline Overview

VoxHammer's pipeline begins with an input 3D model, a user-specified editing region, and a text prompt. The model first renders a view and uses off-the-shelf image diffusion models (e.g., FLUX) for inpainting. Subsequently, native 3D editing is performed in the latent space of the structured 3D diffusion model, conditioned on both the original 3D asset and the edited image.

Figure 1: The pipeline integrates image inpainting and native 3D latent editing for precise local modifications.

Architecture and Inversion

The framework adopts TRELLIS as the base model, which operates in a sparse voxel-based latent space. TRELLIS employs a two-stage denoising process: the structure (ST) stage predicts coarse voxel occupancy, and the sparse-latent (SLAT) stage refines fine-grained geometry and texture. VoxHammer performs inversion in both stages, mapping the textured 3D asset to its terminal noise and caching latents and key-value (K/V) tokens at each timestep.

Figure 2: VoxHammer architecture with two-stage inversion and contextual feature replacement for precise editing.

Inversion is implemented using a Taylor-improved Euler scheme, inspired by RF-Solver, to minimize integration errors and ensure high-fidelity reconstruction. Classifier-free guidance (CFG) is applied only in late timesteps to stabilize inversion and enhance semantic sharpness.

Editing via Latent and Key-Value Replacement

During editing, denoising is initialized from the inverted noise. Latent replacement is performed using binary or soft masks to blend the edited and preserved regions. In the ST stage, latents are blended at each denoising step, while in the SLAT stage, features at unedited coordinates are replaced with their inverted counterparts. Key-value replacement in the attention mechanism further enforces feature-level consistency, preventing semantic leakage into preserved regions. All modifications are applied at inference time, without retraining.

Experimental Results

Quantitative and Qualitative Evaluation

VoxHammer is evaluated on Edit3D-Bench, a human-annotated dataset with labeled 3D editing regions. Metrics include Chamfer Distance (CD), masked PSNR, SSIM, LPIPS for unedited region preservation, FID and FVD for overall 3D quality, and DINO-I and CLIP-T for condition alignment.

VoxHammer achieves the best scores across all metrics, with CD of 0.012, PSNR of 41.68, SSIM of 0.994, LPIPS of 0.027, FID of 23.05, and DINO-I of 0.947. These results demonstrate superior preservation of geometry and texture in unedited regions and high overall quality.

Figure 3: Qualitative comparisons on Edit3D-Bench show VoxHammer's superior editing precision and coherence.

Ablation Studies

Ablation studies confirm the necessity of two-stage inversion and key-value replacement. Inversion in both ST and SLAT stages yields significant improvements in reconstruction fidelity (CD: 0.0055, PSNR: 39.70, SSIM: 0.987, LPIPS: 0.012). Disabling key-value replacement or reinitializing noise degrades preservation quality and introduces artifacts.

Figure 4: Ablation studies highlight the impact of key-value and latent replacement on editing fidelity.

Figure 5: Inversion in both stages is critical for fine-grained geometry and texture reconstruction.

User Study

A user paper with 30 participants shows a strong preference for VoxHammer over Instant3DiT and TRELLIS, with 70.3% favoring its text alignment and 81.2% its overall 3D quality.

Runtime Analysis

VoxHammer edits a 3D asset in approximately 133 seconds, outperforming optimization-based methods (e.g., Vox-E: 32 min) and competitive with multi-view editing approaches.

Generalization and Applications

VoxHammer generalizes to part-aware object editing, compositional 3D scene editing, and NeRF/3DGS asset editing. The method supports both text-conditioned and image-conditioned editing, with competitive performance in preserving unedited regions and maintaining overall quality.

Figure 6: VoxHammer generalizes to part-aware 3D object, scene, and NeRF/3DGS editing.

Figure 7: Visualization results of text-conditioned 3D editing.

Figure 8: Pipeline for text-condition (left) and image-condition (right) 3D editing.

Figure 9: More visualization results of image-condition 3D editing.

Limitations

VoxHammer's text alignment is suboptimal due to limited captioned 3D datasets, and editing fidelity is bounded by the resolution of the TRELLIS backbone. The rendering phase in 3D encoding remains a bottleneck for interactive use.

Conclusion

VoxHammer establishes a training-free paradigm for precise and coherent 3D local editing by leveraging accurate inversion and contextual feature replacement in the latent space of a pretrained structured 3D diffusion model. The method achieves state-of-the-art consistency and quality, generalizes across asset types, and enables efficient synthesis of paired edited data. Future work should address text-conditioned guidance robustness, backbone resolution, and pipeline efficiency to further advance interactive and in-context 3D generation.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (7)

GitHub

Tweets

https://twitter.com/_akhaliq/status/1960707633599447439

https://twitter.com/huanngzh/status/1960853599103410364

https://twitter.com/HuggingPapers/status/1960796441120792744

https://twitter.com/javaeeeee1/status/1960652295030370473

alphaXiv

VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space (57 likes, 0 questions)