Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization (2508.14811v1)

Published 20 Aug 2025 in cs.CV

Abstract: We introduce Tinker, a versatile framework for high-fidelity 3D editing that operates in both one-shot and few-shot regimes without any per-scene finetuning. Unlike prior techniques that demand extensive per-scene optimization to ensure multi-view consistency or to produce dozens of consistent edited input views, Tinker delivers robust, multi-view consistent edits from as few as one or two images. This capability stems from repurposing pretrained diffusion models, which unlocks their latent 3D awareness. To drive research in this space, we curate the first large-scale multi-view editing dataset and data pipeline, spanning diverse scenes and styles. Building on this dataset, we develop our framework capable of generating multi-view consistent edited views without per-scene training, which consists of two novel components: (1) Referring multi-view editor: Enables precise, reference-driven edits that remain coherent across all viewpoints. (2) Any-view-to-video synthesizer: Leverages spatial-temporal priors from video diffusion to perform high-quality scene completion and novel-view generation even from sparse inputs. Through extensive experiments, Tinker significantly reduces the barrier to generalizable 3D content creation, achieving state-of-the-art performance on editing, novel-view synthesis, and rendering enhancement tasks. We believe that Tinker represents a key step towards truly scalable, zero-shot 3D editing. Project webpage: https://aim-uofa.github.io/Tinker

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces Tinker, a framework that produces multi-view consistent 3D edits using sparse inputs without requiring per-scene optimization.
It employs a large-scale pretrained diffusion model and a novel multi-view dataset to enhance semantic alignment and rendering quality.
The framework supports efficient one-shot and few-shot editing, with applications in video reconstruction, quality refinement, and scalable scene optimization.

Tinker: Multi-View Consistent 3D Editing from Sparse Inputs without Per-Scene Optimization

Introduction and Motivation

Tinker presents a unified framework for high-fidelity 3D scene editing that operates in both one-shot and few-shot regimes, eliminating the need for per-scene fine-tuning. The method leverages the latent 3D awareness of large-scale pretrained diffusion models, particularly those based on transformer and flow-matching architectures, to generate multi-view consistent edits from as few as one or two input images. This approach directly addresses the scalability and efficiency bottlenecks of prior 3D editing pipelines, which typically require labor-intensive per-scene optimization to ensure multi-view consistency or to synthesize a sufficient number of edited views for downstream 3D Gaussian Splatting (3DGS) or Neural Radiance Field (NeRF) optimization.

Figure 1: Compared with prior 3D editing approaches, Tinker removes the necessity of labor-intensive per-scene fine-tuning, supports both object-level and scene-level editing, and achieves high-quality results in few-shot and one-shot settings.

Methodology

Multi-View Consistent Editing Pipeline

Tinker’s pipeline consists of two core components:

Referring Multi-View Editor: This module enables reference-driven edits that remain coherent across all viewpoints. The approach is built upon the observation that large-scale image editing models (e.g., FLUX Kontext) can achieve local multi-view consistency when provided with horizontally concatenated image pairs. However, global consistency across all views is not guaranteed, and the model lacks the ability to propagate edits from a reference view to other views without explicit training.
Figure 2: FLUX Kontext achieves pairwise consistency but fails to propagate edits across all views or perform reference-based editing.

To address this, Tinker introduces a data pipeline that synthesizes a large-scale multi-view consistent editing dataset. The dataset is constructed by generating image pairs from 3D-aware datasets, applying editing prompts via a foundation model, and filtering results using DINOv2 feature similarity to ensure both edit success and inter-view consistency. The model is then fine-tuned using LoRA on reference-based editing tasks, where the input is a concatenation of an unedited image and an edited reference image from a different view.

Figure 3: Data pipeline for multi-view consistent editing: (a) generation and filtering of consistent image pairs, (b) LoRA fine-tuning for reference-based editing.

Any-View-to-Video Synthesizer (Scene Completion Model): To efficiently propagate edits from sparse reference views to a dense set of novel views, Tinker leverages a video diffusion model (WAN2.1) conditioned on depth maps and reference images. The model is trained to reconstruct the original scene from sparse views, with depth serving as a strong geometric constraint that encodes both structure and camera pose. During training, the model receives noisy latent tokens, depth tokens, and reference view tokens, concatenated along the sequence dimension, and is optimized using a flow matching loss.
Figure 4: Scene Completion Model architecture: colored contours indicate shared positional embeddings for reference views and target frames, enabling precise spatial-temporal conditioning.

Editing Process

The editing workflow proceeds as follows:

Render a few videos from the original 3DGS and select sparse views for editing.
Edit these views using the multi-view consistent image editor.
Estimate depth maps for the rendered video.
Use the scene completion model to generate edited images for all other views, conditioned on the edited reference views and depth.
Optimize the 3DGS using the completed set of edited images, without any per-scene fine-tuning.
Figure 5: Overview of the editing process: sparse edited views are expanded to dense, consistent edited images via depth constraints and scene completion, then used for 3DGS optimization.

Experimental Results

Comparative Evaluation

Tinker is evaluated against state-of-the-art 3D editing methods (DGE, GaussCtrl, TIP-Editor, EditSplat) on Mip-NeRF-360 and IN2N datasets. Metrics include CLIP Text-Image directional similarity, DINO similarity for cross-view consistency, and aesthetic score. Tinker achieves superior results in both one-shot and few-shot settings, with higher semantic alignment, cross-view consistency, and rendering quality. Notably, Tinker operates efficiently on a single consumer-grade GPU and does not require per-scene fine-tuning, unlike several baselines.

Figure 6: Qualitative comparisons of novel views in different methods.

Ablation Studies

Fine-Tuning for Consistency: Fine-tuning the editor with the synthesized dataset substantially improves global consistency (DINO similarity from 0.862 to 0.943) while maintaining text-image alignment and aesthetic quality.
Figure 7: Qualitative comparisons before and after multi-view consistent image editing fine-tuning.
Image Concatenation: Concatenating more than two images for editing degrades visual quality due to resolution constraints and downsampling. Two-image concatenation yields optimal results.
Figure 8: Effect of the number of horizontally concatenated images on visual quality.
Depth vs. Ray-Map Conditioning: Conditioning the scene completion model on depth maps yields superior geometric consistency and detail preservation compared to ray-map or frame interpolation approaches.
Figure 9: Qualitative comparisons of different scene completion methods; depth conditioning achieves one-to-one camera pose correspondence and superior geometry.
Comparison with VACE: Tinker outperforms VACE in both depth-guided video generation and mask-based editing, achieving better multi-view consistency and fine detail preservation.
Figure 10: Comparison with VACE in depth-guided video generation and mask-based editing.

Applications

Quality Refinement: Tinker can enhance blurry regions and recover finer details via editing prompts.
Figure 11: Ability to refine blurry regions and recover sharper structures.
Video Reconstruction: The scene completion model reconstructs high-quality videos from only the first frame and depth maps, supporting efficient video compression and storage.
Figure 12: High-quality video reconstruction with only the first frame and depth maps as input.
Test-Time Optimization: Users can iteratively replace low-quality generated views with new ones, further improving 3D editing results without retraining.

Dataset and Implementation

Tinker introduces the first large-scale multi-view consistent editing dataset, synthesized using the foundation model and covering diverse scenes, weather, lighting, and artistic styles.

Figure 13: Examples from the synthesized multi-view consistent editing dataset.

Editing prompts are generated via a multi-modal LLM.

Figure 14: Input to a multi-modal large model for the generation of editing prompts.

Implementation leverages FLUX Kontext for image editing and WAN2.1 for scene completion, with LoRA fine-tuning and depth estimation via Video Depth Anything. Training is performed on NVIDIA H100 GPUs.

Limitations and Future Directions

While Tinker significantly lowers the barrier to scalable 3D editing, limitations remain. The synthesized dataset may contain fine detail inconsistencies, and the depth-constrained scene completion model cannot handle large geometric deformations. Future work should address dataset quality and extend the model’s capabilities to more complex edits.

Conclusion

Tinker establishes a general-purpose, scalable framework for multi-view consistent 3D editing from sparse inputs, eliminating the need for per-scene optimization. By bridging advances in 2D diffusion models and 3D scene editing, Tinker enables high-quality object-level and scene-level edits in both one-shot and few-shot settings. The unified approach supports additional tasks such as video reconstruction and compression, and provides a foundation for future research in generalizable, user-friendly 3D content creation.

PDF Markdown

Follow-up Questions

Related Papers

Authors (6)

GitHub

Tweets

https://twitter.com/minchoi/status/1959379749060374996

https://twitter.com/HuggingPapers/status/1959227602520215956

https://twitter.com/taziku_co/status/1959547623473635389

https://twitter.com/javaeeeee1/status/1958477707668324574

YouTube

Show All Videos

alphaXiv

Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization (20 likes, 0 questions)