- The paper introduces Tinker, a framework that produces multi-view consistent 3D edits using sparse inputs without requiring per-scene optimization.
- It employs a large-scale pretrained diffusion model and a novel multi-view dataset to enhance semantic alignment and rendering quality.
- The framework supports efficient one-shot and few-shot editing, with applications in video reconstruction, quality refinement, and scalable scene optimization.
Introduction and Motivation
Tinker presents a unified framework for high-fidelity 3D scene editing that operates in both one-shot and few-shot regimes, eliminating the need for per-scene fine-tuning. The method leverages the latent 3D awareness of large-scale pretrained diffusion models, particularly those based on transformer and flow-matching architectures, to generate multi-view consistent edits from as few as one or two input images. This approach directly addresses the scalability and efficiency bottlenecks of prior 3D editing pipelines, which typically require labor-intensive per-scene optimization to ensure multi-view consistency or to synthesize a sufficient number of edited views for downstream 3D Gaussian Splatting (3DGS) or Neural Radiance Field (NeRF) optimization.
Figure 1: Compared with prior 3D editing approaches, Tinker removes the necessity of labor-intensive per-scene fine-tuning, supports both object-level and scene-level editing, and achieves high-quality results in few-shot and one-shot settings.
Methodology
Multi-View Consistent Editing Pipeline
Tinker’s pipeline consists of two core components:
- Referring Multi-View Editor: This module enables reference-driven edits that remain coherent across all viewpoints. The approach is built upon the observation that large-scale image editing models (e.g., FLUX Kontext) can achieve local multi-view consistency when provided with horizontally concatenated image pairs. However, global consistency across all views is not guaranteed, and the model lacks the ability to propagate edits from a reference view to other views without explicit training.
Figure 2: FLUX Kontext achieves pairwise consistency but fails to propagate edits across all views or perform reference-based editing.
To address this, Tinker introduces a data pipeline that synthesizes a large-scale multi-view consistent editing dataset. The dataset is constructed by generating image pairs from 3D-aware datasets, applying editing prompts via a foundation model, and filtering results using DINOv2 feature similarity to ensure both edit success and inter-view consistency. The model is then fine-tuned using LoRA on reference-based editing tasks, where the input is a concatenation of an unedited image and an edited reference image from a different view.
Figure 3: Data pipeline for multi-view consistent editing: (a) generation and filtering of consistent image pairs, (b) LoRA fine-tuning for reference-based editing.
- Any-View-to-Video Synthesizer (Scene Completion Model): To efficiently propagate edits from sparse reference views to a dense set of novel views, Tinker leverages a video diffusion model (WAN2.1) conditioned on depth maps and reference images. The model is trained to reconstruct the original scene from sparse views, with depth serving as a strong geometric constraint that encodes both structure and camera pose. During training, the model receives noisy latent tokens, depth tokens, and reference view tokens, concatenated along the sequence dimension, and is optimized using a flow matching loss.
Figure 4: Scene Completion Model architecture: colored contours indicate shared positional embeddings for reference views and target frames, enabling precise spatial-temporal conditioning.
Editing Process
The editing workflow proceeds as follows:
Experimental Results
Comparative Evaluation
Tinker is evaluated against state-of-the-art 3D editing methods (DGE, GaussCtrl, TIP-Editor, EditSplat) on Mip-NeRF-360 and IN2N datasets. Metrics include CLIP Text-Image directional similarity, DINO similarity for cross-view consistency, and aesthetic score. Tinker achieves superior results in both one-shot and few-shot settings, with higher semantic alignment, cross-view consistency, and rendering quality. Notably, Tinker operates efficiently on a single consumer-grade GPU and does not require per-scene fine-tuning, unlike several baselines.
Figure 6: Qualitative comparisons of novel views in different methods.
Ablation Studies
- Fine-Tuning for Consistency: Fine-tuning the editor with the synthesized dataset substantially improves global consistency (DINO similarity from 0.862 to 0.943) while maintaining text-image alignment and aesthetic quality.
Figure 7: Qualitative comparisons before and after multi-view consistent image editing fine-tuning.
- Image Concatenation: Concatenating more than two images for editing degrades visual quality due to resolution constraints and downsampling. Two-image concatenation yields optimal results.
Figure 8: Effect of the number of horizontally concatenated images on visual quality.
- Depth vs. Ray-Map Conditioning: Conditioning the scene completion model on depth maps yields superior geometric consistency and detail preservation compared to ray-map or frame interpolation approaches.
Figure 9: Qualitative comparisons of different scene completion methods; depth conditioning achieves one-to-one camera pose correspondence and superior geometry.
- Comparison with VACE: Tinker outperforms VACE in both depth-guided video generation and mask-based editing, achieving better multi-view consistency and fine detail preservation.
Figure 10: Comparison with VACE in depth-guided video generation and mask-based editing.
Applications
Dataset and Implementation
Tinker introduces the first large-scale multi-view consistent editing dataset, synthesized using the foundation model and covering diverse scenes, weather, lighting, and artistic styles.
Figure 13: Examples from the synthesized multi-view consistent editing dataset.
Editing prompts are generated via a multi-modal LLM.
Figure 14: Input to a multi-modal large model for the generation of editing prompts.
Implementation leverages FLUX Kontext for image editing and WAN2.1 for scene completion, with LoRA fine-tuning and depth estimation via Video Depth Anything. Training is performed on NVIDIA H100 GPUs.
Limitations and Future Directions
While Tinker significantly lowers the barrier to scalable 3D editing, limitations remain. The synthesized dataset may contain fine detail inconsistencies, and the depth-constrained scene completion model cannot handle large geometric deformations. Future work should address dataset quality and extend the model’s capabilities to more complex edits.
Conclusion
Tinker establishes a general-purpose, scalable framework for multi-view consistent 3D editing from sparse inputs, eliminating the need for per-scene optimization. By bridging advances in 2D diffusion models and 3D scene editing, Tinker enables high-quality object-level and scene-level edits in both one-shot and few-shot settings. The unified approach supports additional tasks such as video reconstruction and compression, and provides a foundation for future research in generalizable, user-friendly 3D content creation.