ObjFiller-3D: Video-Driven 3D Inpainting
- The paper introduces a novel approach that reframes multiview 3D inpainting as a 360° looped video problem to enforce cross-view consistency.
- It leverages pretrained video editing models with LoRA adaptation and utilizes 3D Gaussian Splatting to efficiently reconstruct complete 3D objects.
- The method demonstrates superior reconstruction quality and speed by mitigating texture and geometry inconsistencies compared to traditional 2D inpainting techniques.
Searching arXiv for ObjFiller-3D and closely related 3D inpainting work. ObjFiller-3D is a 3D object inpainting framework that reframes multiview object completion as a video inpainting problem. Given an incomplete 3D object , a 3D masked region , a text description , and optionally a reference image , it seeks to generate a plausible missing part such that forms a coherent, realistic 3D object. Its central premise is that the dominant failure mode of prior multiview 3D inpainting pipelines is cross-view inconsistency: independently or weakly coupled 2D inpainting tends to produce contradictory textures and structures across views, which then degrade the reconstructed 3D result into blurred textures, spatial discontinuities, and blurry or inaccurate geometry. ObjFiller-3D addresses this by treating ordered renderings of a masked 3D object as a 360° looped video, adapting a pretrained video editing model with LoRA, and reconstructing the completed object with 3D Gaussian Splatting (3DGS) from the resulting temporally coherent frames (Feng et al., 25 Aug 2025).
1. Task formulation and problem setting
The paper defines the task as 3D object inpainting or completion rather than standard text-to-3D generation or simple texture editing. The required output must simultaneously fill missing geometry, restore appearance and texture, maintain cross-view consistency, and remain reconstructable into a faithful 3D representation. The stated practical motivations include cultural heritage restoration, digital reconstruction, and 3D asset editing (Feng et al., 25 Aug 2025).
Formally, the inputs are an incomplete 3D object , a 3D mask region , a text prompt , and an optional reference image . The desired completion 0 satisfies
1
The masking protocol follows three types adopted from Instant3dit: Convexhull, where the missing part 2 lies fully inside the mask; Surface, where the mask covers a small surface region; and Volume, where the mask tightly encloses the object (Feng et al., 25 Aug 2025).
The method is motivated by a critique of prior multiview 2D inpainting strategies. Conventional 2D inpainting does not explicitly model cross-view dependencies, treats each image as an isolated completion problem, and can therefore assign different textures, structure, or semantics to the same missing region across views. Even methods that improve consistency through a grid prior still effectively enforce consistency over only four views. ObjFiller-3D argues that this is inadequate for complex objects, dense viewpoint coverage, and non-uniform or detailed textures (Feng et al., 25 Aug 2025). This diagnosis is consistent with NeRFiller’s observation that off-the-shelf 2D inpainting models are usually plausible per view but mutually inconsistent, and with IMFine’s argument that image-level agreement alone does not remove the cross-view drift that later produces blurry or floating artifacts in 3D fusion (Weber et al., 2023, Shi et al., 6 Mar 2025).
The paper’s key conceptual shift is to treat multiview renderings of a static object as a video sequence. Video diffusion and video editing models are trained to maintain inter-frame coherence, shared object identity across frames, appearance persistence, and structure continuity. ObjFiller-3D therefore reinterprets a sequence of rendered views around an object as a 360° looped video and uses a video inpainting prior to enforce multi-view texture coherence, shape continuity, and viewpoint-consistent hallucination (Feng et al., 25 Aug 2025).
2. Representation and end-to-end pipeline
ObjFiller-3D uses 3D Gaussian Splatting as its final 3D representation. A 3D Gaussian centered at 3 with covariance 4 is defined as
5
where 6 is the rotation matrix and 7 is the scaling matrix. Rendering uses volumetric splatting: 8 where 9 is the color of the 0-th Gaussian and 1 is its opacity. The paper motivates 3DGS by its efficiency and strong rendering quality (Feng et al., 25 Aug 2025).
The pipeline begins from a masked 3D object and a predefined set of camera poses 2. The object and the mask are rendered into image-mask pairs according to
3
For dataset construction, the paper renders 16 views at 4 resolution, with fixed elevation 20° and azimuths uniformly sampled from 5 to 6. The supplementary Blender setup uses a white background, uniform lighting, the Cycles path tracer, object normalization into the bounded cube 7, field of view 50°, and camera radius 2.7 (Feng et al., 25 Aug 2025).
These 16 views are then converted into a looped video. Because VACE requires 8 frames, the first image and mask are duplicated as the 17th frame. The resulting sequence is therefore a 360° loop consisting of 16 ordered views plus one duplicate frame to enforce continuity at the wrap-around boundary. The sequence 9, together with the text prompt 0 and optional reference image 1, is fed into VACE, a pretrained video editing and inpainting model, which outputs a set of inpainted and temporally coherent frames 2 (Feng et al., 25 Aug 2025).
The final stage reconstructs the completed object with 3DGS from these consistent inpainted views. The paper states the reconstruction objective in terms of optimizing a Gaussian set 3 against the inpainted targets using the standard 3DGS loss function. It does not provide an explicit decomposition of that loss, and it likewise does not introduce a separate explicit cross-view consistency term at reconstruction time; consistency is instead enforced primarily during the inpainting stage (Feng et al., 25 Aug 2025).
3. Adapting video diffusion to multiview 3D inpainting
A central technical contribution is the analysis of the representation gap between natural videos and rendered multiview 3D data. The paper identifies three components of this gap. First, 3D objects and scenes are usually modeled with full 360° coverage, whereas real videos are generally partial and front-facing. Second, video data typically include motion, deformation, occlusion changes, and motion blur, whereas the target 3D inpainting problem concerns mostly static content viewed from different angles. Third, 3D rendering uses uniformly sampled camera viewpoints for structural completeness, while video models are trained on temporally sampled data with non-uniform motion emphasis such as slow-fast encoding (Feng et al., 25 Aug 2025).
ObjFiller-3D addresses this mismatch through lightweight adaptation rather than full retraining. LoRA weights are inserted into every VACE transformer layer; the original transformer parameters are frozen; and only low-rank matrices are learned. The paper specifies
4
with rank 5. This is the paper’s primary mechanism for biasing the pretrained video prior toward the geometry and consistency statistics of rendered multiview object sequences (Feng et al., 25 Aug 2025).
The training objective is a flow matching loss. Let 6 be the predicted velocity field at time 7 and state 8, conditioned on frames 9, masks 0, and prompt 1, and let 2 denote the target velocity field induced by interpolation between noise and data distributions. The stated loss is
3
4
The formatting in the text is imperfect, but the intended training objective is an expected squared error between predicted and target velocity fields (Feng et al., 25 Aug 2025).
The sequence design itself is part of the adaptation. The 16 views are ordered around the object, the first frame is duplicated to close the loop, and reference-based conditioning, when used, is implemented by prepending a reference frame and corresponding zero mask. The paper does not describe additional modules such as explicit cross-frame attention modifications, separate consistency losses, camera trajectory encoders, geometric warping constraints, or latent alignment objectives. Instead, the adaptation is mainly the combination of multiview-to-video sequence construction, loop closure by frame duplication, and LoRA fine-tuning (Feng et al., 25 Aug 2025).
This places ObjFiller-3D in an interesting position relative to earlier multiview diffusion work. NeRFiller exploited the consistency behavior of 2D inpainting diffusion through a 5 grid prior and joint multi-view regrouping over many views, but it did so without fine-tuning the diffusion backbone and still required iterative dataset update for 3D consolidation (Weber et al., 2023). By contrast, ObjFiller-3D adapts a video model directly to 3D inpainting and is explicitly not purely training-free (Feng et al., 25 Aug 2025).
4. Reference-based completion and editing control
The paper introduces a straightforward reference-based 3D inpainting mode. If a reference image is available, it is inserted as the first frame of the video input, and an all-zero mask is inserted as the first mask frame so that this frame is preserved rather than edited. After generation, the auxiliary first frame is discarded. This uses the reference image as an appearance and semantic anchor while leaving the rest of the multiview inpainting setup unchanged (Feng et al., 25 Aug 2025).
The paper presents this mechanism as a control interface for cases where unconstrained generation may vary significantly across trials. It is explicitly motivated for historical reconstruction, artifact restoration, and controlled asset editing, where a desired appearance may already be known. The reported benefits are improved alignment with user expectation, appearance control, semantic stability, and consistency with exemplar imagery. The method does not describe retrieval, explicit feature injection, cross-attention from reference features, alignment losses, or identity preservation losses; the reference mechanism is intentionally simple and relies on VACE’s native support for reference-driven video editing (Feng et al., 25 Aug 2025).
The same formulation also supports editing workflows. In Blender, the procedure is to place a 3D geometric mask in the region to edit, provide a prompt, and then use ObjFiller-3D to replace, add, or remove parts. The paper further states that the same basic machinery extends beyond completion to scene inpainting and editing applications (Feng et al., 25 Aug 2025).
A plausible implication is that ObjFiller-3D emphasizes conditioning simplicity over explicit geometric control inside the generative model. That distinguishes it from approaches such as IMFine, which reconstruct missing regions through geometry-guided reference-view completion, warping, and scene-adapted multiview refinement for object removal in reconstructed scenes (Shi et al., 6 Mar 2025). It also distinguishes it from 3DDesigner, whose multiview consistency comes from a NeRF-like coarse prior and two-stream asynchronous diffusion for generation and editing rather than from a video-inpainting formulation (Li et al., 2022).
5. Data, optimization, and implementation
The training data combine the Instant3dit dataset for masked objects and masks, the corresponding complete objects from Objaverse, and captions from Cap3D. The paper states that Instant3dit uses about 7k high-quality Objaverse objects, while ObjFiller-3D trains on 3,000 objects from a reprocessed dataset. For each object, the authors render 16 views at 6 resolution with fixed elevation 20° and azimuths uniformly sampled around 7. The prepared dataset is reported as approximately 18 GB (Feng et al., 25 Aug 2025).
The principal training hyperparameters are explicit. LoRA rank is 32, the learning rate is 8, batch size is 4, and training runs for 10 epochs. For the larger backbone, VACE 14B is trained in half precision, uses around 60 GB VRAM, and takes about 3 days on 1 NVIDIA A800. The supplementary implementation also reports a VACE1.3B model trained in full precision on RTX 4090, requiring approximately 20 GB of memory, with 3,000 steps per epoch and roughly 6 hours per epoch (Feng et al., 25 Aug 2025).
At inference time, the method uses the UniPC sampler with 20 inference steps, CFG guidance scale = 4, and LoRA scale = 1. The reported end-to-end reconstruction time is under 10 minutes for ObjFiller-3D, compared with over 40 minutes for NeRFiller. The paper attributes this speed difference largely to the ability to skip NeRFiller’s Iterative Dataset Update (IDU) stage because the video model already produces highly consistent views (Feng et al., 25 Aug 2025, Weber et al., 2023).
Prompting is similarly concrete. Object prompts come from Cap3D captions, and the supplementary material lists manual prompts for NeRF synthetic objects and scene experiments, including “A video of a green striped chair,” “A video of drums,” and “A video of LEGO bulldozer model.” For NeRFiller comparison, the masks are center occlusions of 9 in each image (Feng et al., 25 Aug 2025).
6. Empirical results, comparative position, and limitations
The experimental evaluation is organized around two main comparisons. Against Instant3dit, the paper uses 300 held-out objects and evaluates image-grid quality on the same four selected views, 0, 1, 2, and 3. Because the original Instant3dit reconstruction code was unavailable, the supplementary material states that InstantMesh was used as a unified reconstruction backend for LPIPS-based consistency evaluation. Against NeRFiller, the paper uses the dataset introduced by NeRFiller, trains with 180 equally spaced images, evaluates on 20 held-out images, and compares final reconstructed objects. The method is also extended to 4 scenes for scene inpainting (Feng et al., 25 Aug 2025, Weber et al., 2023).
| Setting | Compared result | ObjFiller-3D result |
|---|---|---|
| Instant3dit comparison | Instant3dit: FID 100.9, LPIPS 0.253, CLIP 29.81 | ObjFiller-3D-1.3B: 92.07, 0.190, 29.87; ObjFiller-3D-14B: 90.75, 0.195, 30.19 |
| NeRFiller comparison | Masked NeRF: 7.76 / 0.71 / 0.37; SD Image Cond: 14.15 / 0.76 / 0.28; NeRFiller: 15.89 / 0.82 / 0.23 | Ours: 26.62 / 0.93 / 0.07 |
Against Instant3dit, the reported metrics are FID, LPIPS, and CLIP similarity. Both ObjFiller variants improve substantially over Instant3dit in FID and LPIPS, while CLIP is slightly higher. The abstract highlights LPIPS 0.19 vs. Instant3dit 0.25 (Feng et al., 25 Aug 2025).
Against NeRFiller, the paper reports PSNR, SSIM, and LPIPS. On that benchmark, ObjFiller-3D reaches 26.62 PSNR, 0.93 SSIM, and 0.07 LPIPS, compared with 15.89, 0.82, and 0.23 for NeRFiller. The abstract emphasizes PSNR 26.6 vs. 15.9. The supplementary per-object breakdown shows ObjFiller-3D outperforming NeRFiller on all eight NeRF synthetic categories: chair 4, drums 5, ficus 6, hotdog 7, lego 8, materials 9, mic 0, and ship 1 (Feng et al., 25 Aug 2025).
The ablations support the method’s principal design claims. Increasing the number of input views from 80 to 140 improves performance from 22.76 / 0.89 / 0.11 to 26.68 / 0.93 / 0.06, indicating that the method can exploit dense view information rather than being limited by it. The LoRA ablation shows that adaptation matters: for the 1.3B model, removing LoRA changes FID / LPIPS / CLIP from 92.07 / 0.190 / 29.87 to 107.2 / 0.205 / 29.76; for the 14B model, the corresponding change is from 90.75 / 0.195 / 30.19 to 104.8 / 0.219 / 30.19 (Feng et al., 25 Aug 2025).
The paper characterizes the main strengths as strong multiview consistency, better reconstruction quality, support for many views rather than only four, faster runtime than iterative 3D editing pipelines, reference-guided control, and extensibility to scene inpainting and object editing. Its explicit limitation is that the method is fundamentally bounded by the capabilities of current video foundation models. It also does not provide a detailed failure-case section, user study, or human preference test, and it does not report training-set generalization splits in depth, explicit robustness to severe out-of-domain objects, quantitative scene-inpainting metrics, or deep analysis of sparse-view settings (Feng et al., 25 Aug 2025).
Several further limitations are stated as implications rather than direct claims. A plausible implication is that reconstruction quality remains dependent on the quality of the generated multiview frames, because the method does not include explicit geometry-aware constraints inside the video model and instead induces consistency through video priors. Another plausible implication is that real-world generalization, while suggested by the paper’s practical deployment claims, remains less thoroughly validated than the rendered object setting on which most experiments are based (Feng et al., 25 Aug 2025). In comparison, IMFine places heavier emphasis on explicit geometry priors, inpainting-mask detection, and scene-specific test-time adaptation for unconstrained scenes, which suggests a complementary rather than identical solution space for 3D inpainting (Shi et al., 6 Mar 2025).
ObjFiller-3D is therefore best understood as a method that relocates the central burden of 3D object completion from independent image inpainting and iterative 3D consolidation to video-consistent multiview generation. Its distinctive contribution is not a new 3D representation, but a cross-domain adaptation of video diffusion priors to multiview 3D object inpainting, combined with a simple reference-based control mechanism and a conventional 3DGS reconstruction backend (Feng et al., 25 Aug 2025).