Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 58 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 33 tok/s Pro

GPT-4o 115 tok/s Pro

Kimi K2 183 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

GeoComplete: Geometry-Aware Diffusion for Reference-Driven Image Completion (2510.03110v1)

Published 3 Oct 2025 in cs.CV

Abstract: Reference-driven image completion, which restores missing regions in a target view using additional images, is particularly challenging when the target view differs significantly from the references. Existing generative methods rely solely on diffusion priors and, without geometric cues such as camera pose or depth, often produce misaligned or implausible content. We propose GeoComplete, a novel framework that incorporates explicit 3D structural guidance to enforce geometric consistency in the completed regions, setting it apart from prior image-only approaches. GeoComplete introduces two key ideas: conditioning the diffusion process on projected point clouds to infuse geometric information, and applying target-aware masking to guide the model toward relevant reference cues. The framework features a dual-branch diffusion architecture. One branch synthesizes the missing regions from the masked target, while the other extracts geometric features from the projected point cloud. Joint self-attention across branches ensures coherent and accurate completion. To address regions visible in references but absent in the target, we project the target view into each reference to detect occluded areas, which are then masked during training. This target-aware masking directs the model to focus on useful cues, enhancing performance in difficult scenarios. By integrating a geometry-aware dual-branch diffusion architecture with a target-aware masking strategy, GeoComplete offers a unified and robust solution for geometry-conditioned image completion. Experiments show that GeoComplete achieves a 17.1 PSNR improvement over state-of-the-art methods, significantly boosting geometric accuracy while maintaining high visual quality.

Summary

The paper introduces a dual-branch diffusion model that leverages explicit 3D geometric guidance through projected point clouds to ensure robust image completion.
It employs a target-aware masking strategy and joint self-attention to effectively fuse geometric cues with visual content, mitigating challenges like occlusion and viewpoint variation.
Experimental results demonstrate a 17.1% PSNR improvement and superior geometric consistency compared to existing methods, validating its effectiveness in reference-driven image editing.

Geometry-Aware Diffusion for Reference-Driven Image Completion: GeoComplete

Introduction

GeoComplete introduces a geometry-aware framework for reference-driven image completion, addressing the limitations of prior generative and geometry-based methods. The approach leverages explicit 3D structural guidance via projected point clouds and a dual-branch diffusion architecture, enabling robust synthesis of missing regions with strong geometric consistency. The method is designed to overcome challenges posed by viewpoint variation, occlusion, dynamic content, and camera settings, which often lead to misaligned or implausible completions in existing approaches.

Figure 1: GeoComplete completes missing regions in a target image using reference images, preserving geometric consistency more effectively than Paint-by-Example.

Methodology

Point Cloud Generation

GeoComplete's pipeline begins with point cloud generation from reference and target images. The Visual Geometry Grounded Transformer (VGGT) predicts camera parameters and depth maps in a single forward pass, avoiding error accumulation typical of multi-stage geometry pipelines. To mitigate degradation in dynamic scenes, LangSAM segments and removes dynamic regions using text prompts, which can be user-provided or generated by an LLM. This ensures that geometry estimation focuses on static content, yielding reliable 3D attributes.

Figure 2: Point cloud generation pipeline: LangSAM segments dynamic objects, VGGT estimates geometry, and the resulting point cloud is projected onto target and reference views.

Dual-Branch Diffusion Architecture

The core of GeoComplete is a dual-branch diffusion model. The target branch encodes the masked target image, while the cloud branch encodes the projected point cloud. Both branches are processed in parallel, and joint self-attention fuses their latent features, allowing geometric cues to guide synthesis. The attention mask is constructed to ensure that masked tokens in the target branch can attend to corresponding cloud-branch tokens, facilitating direct geometric guidance even when visual information is absent.

Figure 3: GeoComplete framework overview: point cloud construction, target-aware masking, and dual-branch diffusion with joint self-attention for geometry-guided synthesis.

Target-Aware Masking

GeoComplete introduces a target-aware masking strategy during training. Informative regions—areas visible in references but missing in the target—are identified via 3D projection. Conditional reference masking randomly occludes these regions in the reference images, while conditional cloud masking applies random padding to redundant regions in the projected point cloud. This encourages the model to learn from complementary content and prevents over-reliance on potentially inaccurate geometry.

Experimental Results

Quantitative Evaluation

GeoComplete is evaluated on RealBench and QualBench, challenging datasets with significant viewpoint and appearance variation. The method achieves a 17.1% PSNR improvement over state-of-the-art methods, with substantial gains in SSIM, LPIPS, DreamSim, DINO, and CLIP metrics. User studies confirm superior realism, geometric consistency, and visual coherence.

Qualitative Analysis

GeoComplete consistently reconstructs fine details and maintains scene-level alignment, outperforming generative baselines such as RealFill and Paint-by-Example, which often hallucinate or misalign content due to lack of explicit geometry.

Figure 4: Qualitative comparison: GeoComplete synthesizes missing regions with superior geometric consistency compared to TransFill, RealFill, and Paint-by-Example.

Figure 5: Additional qualitative results: GeoComplete preserves spatial alignment and fine details under large viewpoint changes.

Ablation and Robustness

Ablation studies demonstrate that removing geometric guidance, joint self-attention, or target-aware masking leads to significant drops in all metrics. Robustness experiments show that conditional cloud masking and joint self-attention mitigate the impact of noisy or sparse point clouds and segmentation errors, maintaining strong performance even under degraded upstream predictions.

Implementation Considerations

GeoComplete is implemented atop Stable Diffusion 2 Inpainting, fine-tuned per scene using LoRA with a rank of 8. The pipeline requires VGGT and LangSAM for geometry and segmentation, with preprocessing overhead under 30 seconds per scene. Training for 2,000 steps (72 minutes on four 24GB GPUs) yields optimal results, but promising reconstructions are achieved within 500 steps (18 minutes), outperforming RealFill at equivalent steps.

Resource Requirements

Four NVIDIA GPUs (24GB each) for training and inference
VGGT and LangSAM for geometry and segmentation
Per-scene fine-tuning (batch size 16, 2,000 steps)
Preprocessing: <30s per scene

Limitations

Completion quality depends on the accuracy of geometry estimation and the visual quality of reference images. Degradations such as rain, haze, or low-light conditions can adversely affect both geometry and appearance guidance. Conditional cloud masking mitigates, but does not eliminate, the impact of inaccurate point clouds.

Implications and Future Directions

GeoComplete demonstrates that explicit 3D geometric priors are critical for spatially consistent image completion in complex scenes. The dual-branch diffusion architecture and target-aware masking provide a unified solution for leveraging both visual and geometric cues. The framework's robustness to upstream errors and its superior performance suggest that geometry-aware generative models will be central to future advances in reference-driven image editing, occlusion removal, and scene understanding.

Potential future developments include:

Pre-training LoRA parameters on large-scale datasets for faster per-scene adaptation
Integration of degradation-robust priors or restoration modules to handle adverse conditions
Extension to video completion and multi-modal scene synthesis

Conclusion

GeoComplete sets a new standard for reference-driven image completion by explicitly conditioning generative diffusion models on projected 3D geometry and employing a dual-branch architecture with joint self-attention. The method achieves substantial improvements in geometric consistency and perceptual quality over prior approaches, with demonstrated robustness to upstream errors. These results underscore the importance of geometry-aware conditioning in generative models and open avenues for further research in spatially consistent image and scene synthesis.