Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 102 tok/s

Gemini 2.5 Pro 40 tok/s Pro

GPT-5 Medium 43 tok/s

GPT-5 High 49 tok/s Pro

GPT-4o 108 tok/s

GPT OSS 120B 468 tok/s Pro

Kimi K2 243 tok/s Pro

2000 character limit reached

Stable Diffusion Inpainting

Updated 2 July 2025

Stable Diffusion Inpainting is a diffusion-based approach that restores or edits missing image regions using mask-guided, text-conditioned, and reference-enhanced sampling.
It leverages iterative denoising with region-aware techniques like spatially variant noise schedules and auxiliary priors to ensure high fidelity and efficient restoration.
Recent advancements integrate multimodal inputs—text, shape, and exemplar images—to achieve precise, semantically coherent inpainting for both images and videos.

Stable Diffusion Inpainting refers to a family of methods, models, and algorithmic frameworks for restoring or editing images by synthesizing plausible content within masked, missing, or modified regions, leveraging diffusion-based generative models. Building on the foundational principles of denoising diffusion probabilistic models (DDPMs), Stable Diffusion Inpainting encompasses a wide ecosystem of architectures, training strategies, conditioning mechanisms, and extensions—including text and shape guidance, multimodal modalities, precise region control, and both image and video domains.

1. Algorithmic Principles and Conditioned Sampling

The central mechanism of Stable Diffusion Inpainting is a conditional generative process: starting from a noisy version of an image with masked regions, the model iteratively denoises while being constrained to preserve the observed (unmasked) areas and plausibly synthesize the missing content.

In the typical DDPM formalism, a forward diffusion process gradually adds Gaussian noise to an image $I_0$ : $I_t = \sqrt{\bar{\alpha}_t} I_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon,\qquad \epsilon\sim\mathcal{N}(0, I)$ At inference, the model reverses this process, predicting at each step either the clean image or the noise, conditioned on:

Known pixels (mask-guided inpainting)
Optional text prompts (text-to-image guidance)
Instance-specific priors (e.g., reference images, structure maps, exemplar objects)

Direct conditioning is intractable in high dimensions, leading to a variety of approximations: “mask overwriting,” resampling strategies, Langevin dynamics on the conditional posterior (e.g., LanPaint (Zheng et al., 5 Feb 2025)), or explicit constraint steering via tractable probabilistic models (TPMs) (Liu et al., 2023).

Conditioning by Mask

Let $M$ be the mask (1 for missing pixels, 0 for known). At each step, the forward and reverse processes are masked so that only missing pixels receive noise and are synthesized; known pixels are kept fixed: $I^\textrm{obs}_{t} = I^\textrm{obs}_0$

$I^\textrm{miss}_{t} \sim \mathcal{N}(\sqrt{\bar{\alpha}_t} I^\textrm{miss}_0, (1 - \bar{\alpha}_t)I)$

Such conditioning can be executed exactly via region-aware scheduling (Kim et al., 12 Dec 2024), approximate projection [RePaint], or MCMC/Langevin approaches [LanPaint].

2. Architectural Extensions and Conditioning Modalities

Mask-Guided and Region-Aware Design

Modern methods recognize limitations of the canonical pipeline, such as leakage of generative signals into known contexts or inefficiency due to nested iterative procedures (as in RePaint). Recent advances redesign the process:

Spatially Variant Noise Schedules: RAD (Kim et al., 12 Dec 2024) assigns each pixel its own diffusion schedule, ensuring that only masked regions are synthesized while known regions are preserved, resulting in large efficiency improvements and enabling asynchronous region generation and localized control.
Structural Guidance: By utilizing local structure tensors or anisotropic splatting (Fein-Ashley et al., 2 Dec 2024), or explicitly modeling sparse structure (edges, feature maps) (Liu et al., 29 Mar 2024), models enforce coherence along edges and textures, improving inpainting quality especially for large, complex masks.
Auxiliary Guidance Inputs: Innovations such as ASUKA (Wang et al., 2023) employ masked autoencoders (MAE) as prior modules, aligning inpainting predictions with visible context, and specialized decoders for seamless pixel transitions at mask boundaries.

Multimodal Conditioning: Text, Shape, Reference

Textual Guidance: Text-conditioned inpainting is enabled via CLIP or Transformer-based encoders. SmartBrush (Xie et al., 2022), Diffree (Zhao et al., 24 Jul 2024), and Uni-paint (Yang et al., 2023) allow prompts describing content, with or without masks.
Shape or Mask Guidance: Explicit masks (binary or soft/precision-controlled) offer pixel-accurate spatial guidance (SmartBrush (Xie et al., 2022)), while Diffree (Zhao et al., 24 Jul 2024) introduces mask-free object placement by learning object localization from text alone.
Reference or Exemplar Guidance: Reference-based inpainting leverages cross-image attention or correspondence estimation. CorrFill (Liu et al., 4 Jan 2025) dynamically extracts patch-level correspondences from attention maps, guiding denoising to ensure geometrically and semantically faithful completions. Probabilistic Circuits (Liu et al., 2023) can systematically enforce region-based or semantic constraints.
Freeform Modalities: Uni-paint (Yang et al., 2023) integrates multiple guidance signals (text, strokes, exemplar images) via cross- and self-attention masking, supporting combinations unattainable in single-modal architectures.

3. Performance, Evaluation, and Limitations

Quantitative and Qualitative Benchmarks

Performance of Stable Diffusion Inpainting variants is measured using:

Perceptual Quality: FID, LPIPS, NIMA
Fidelity to prompt/reference: CLIP score, GPT4V Score, PSNR, SSIM
Human Studies: User preference for realism, alignment, spatial consistency

Results consistently show that region-aware (Kim et al., 12 Dec 2024), structurally guided (Liu et al., 29 Mar 2024, Fein-Ashley et al., 2 Dec 2024), and personalized/fine-tuned models (Froech et al., 20 Feb 2025) outperform prior baselines, especially on large or complex masks, and provide improvements in both faithfulness to input guidance and boundary integration.

Efficiency

RAD (Kim et al., 12 Dec 2024) achieves orders-of-magnitude speedup over nested-loop approaches (e.g., RePaint), reducing inference to a plain, single-pass reverse process. LanPaint (Zheng et al., 5 Feb 2025) is compatible with fast ODE-based samplers, requiring only a handful of inner Langevin steps per timestep, and does not require retraining.

Limitations

Recursive Instability: Recursive application of inpainting (RIP) leads to progressive degradation and image collapse (Conde et al., 27 Jun 2024), with severity depending on mask size, image content, and iteration count. This highlights the lack of long-term stability when repeatedly chaining inpainting in inference.
Boundary Artifacts and Hallucination: Naive generative processes can introduce color or structural inconsistencies at mask borders, or hallucinated objects unrelated to context—addressed by specialized decoders (Wang et al., 2023), auxiliary priors, and explicit alignment modules.
Semantic Discrepancy: In conventional diffusion, mismatches in semantic density between masked and unmasked regions can cause structural seams (Liu et al., 29 Mar 2024). Time-dependent guidance, multi-scale structure conditioning, and resampling strategies counteract this effect.
Generalization and Faithfulness: Vanilla inpainting models may not accurately reconstruct unseen or domain-specific content. Domain personalization (e.g., FacaDiffy (Froech et al., 20 Feb 2025)) and reference-based attention (CorrFill (Liu et al., 4 Jan 2025)) are crucial in such cases.

4. Specialized and Emerging Domains

Video Inpainting

Recent methods extend Stable Diffusion Inpainting to the video domain, tackling temporal consistency and propagation:

Latent and Temporal Attention Integration: DiffuEraser (Li et al., 17 Jan 2025) and FFF-VDI (Lee et al., 21 Aug 2024) leverage strong spatial and temporal modeling, use prior initialization, DDIM inversion, and deformable alignment to ensure per-frame and cross-frame coherence, outperforming flow- or transformer-based video inpainting in both perceptual and temporal metrics.
Function-Space Models: Warped Diffusion (Daras et al., 21 Oct 2024) adapts SDXL for video sequence processing by leveraging optical flow to warp noise realizations and applies test-time equivariance self-guidance to align output frames, yielding temporally consistent inpainting and superresolution.

Domain-Specific and Structural Inpainting

Urban/modeling domains: FacaDiffy (Froech et al., 20 Feb 2025) demonstrates personalized, domain-specific Stable Diffusion fine-tuned on synthetic conflict maps, enabling facade completion in 3D city modeling pipelines with large downstream impact on semantic reconstruction tasks.
Image Forensics: InpDiffusion (Wang et al., 6 Jan 2025) reframes inpainting localization as mask generation via conditional diffusion, combining semantic and edge supervision to robustly detect tampered regions, outperforming discriminative approaches and maintaining resilience to adversarial distortions.

5. Implementation and Practical Considerations

Model Components and Typical Pipelines

VQGAN/Autoencoder Backbone: Stable Diffusion operates in latent space, with asymmetric VQGAN decoders (Zhu et al., 2023) improving preservation of non-masked regions and overall image fidelity.
Conditioning strategies: Masks, text, reference images, strokes, structure maps, spatial scheduling—fed as separate channels, attention masks, or input latents.
Loss Functions: Combination of noise prediction (L2), segmentation (Dice), perceptual (VGG/LPIPS), adversarial (GAN), and harmonization (color/latent augmentation) losses, tailored for pixel and semantic fidelity.

Computational Resources

Training: Fine-tuning (personalization, domain adaptation) is feasible with moderate GPU resources (e.g., 8× V100 over several days for VQGAN decoder retraining (Zhu et al., 2023)).
Inference: State-of-the-art models (e.g., RAD, LanPaint) support batch and fast ODE-based samplers, yielding inference times orders of magnitude lower than iterative DDPM models; video inpainting models exploit batched spatial-temporal attention and can scale efficiently across frames.

Open-Source and Adoption

Many of the latest architectures, including asymmetric VQGAN (Zhu et al., 2023), LanPaint (Zheng et al., 5 Feb 2025), Uni-paint (Yang et al., 2023), and FacaDiffy (Froech et al., 20 Feb 2025), provide public repositories and implementation scripts, facilitating integration into research and applied workflows.

6. Applications, Limitations, and Prospects

Applications

Photo and video restoration, editing, object removal/addition, and background extension
Interactive content creation with complex multimodal (text, sketch, exemplar) input
Medical imaging, remote sensing, and urban 3D modeling for robust, structure-aware restoration
Image forensics and tampering localization

Limitations and Current Challenges

Recursive instability and model collapse under repeated inference scenarios (Conde et al., 27 Jun 2024)
Complex domain adaptation for highly structured or domain-specific content
Semantic faithfulness and precise spatial control, especially in mask-free text-controlled pipelines

Future Directions

Unified multimodal & region-aware frameworks: Enabling fast, stable, and controllable inpainting under arbitrary user-specified modalities—text, mask, reference, and structure cues—across both image and video domains.
Scalable and tractable probabilistic guidance: Developing tractable TPMs and more efficient conditional samplers for pixel- or latent-space control (Liu et al., 2023).
Robustness and theoretical analysis: Quantifying stability under recursive operations and developing principled methodologies to mitigate collapse.
Broader generalization: Leveraging synthetic data for domain adaptation and combining generative and reconstructive paradigms (cf. ASUKA).

Model/Method	Key Innovation	Notable Capability
RAD (Kim et al., 12 Dec 2024)	Per-pixel schedules	100× speedup, SoTA quality
LanPaint (Zheng et al., 5 Feb 2025)	Bidirectional Langevin	ODE samplers, training-free, exact inference
SmartBrush (Xie et al., 2022)	Mask precision & text	Shape-aligned object inpainting
Uni-paint (Yang et al., 2023)	Multimodal guidance	Arbitrary (text, sketch, ref) fusion
CorrFill (Liu et al., 4 Jan 2025)	Patch correspondence	Faithfulness to reference image
FacaDiffy (Froech et al., 20 Feb 2025)	Domain personalization	Structured mask inpainting for 3D city modeling
Diffree (Zhao et al., 24 Jul 2024)	Text-only addition, mask prediction	Seamless object addition w/o user masks
ASUKA (Wang et al., 2023)	MAE prior, inpaint decoder	Context/visual consistency
StrDiffusion (Liu et al., 29 Mar 2024)	Time-dependent structure	Semantic consistency across masks

Stable Diffusion Inpainting represents the convergence of generative diffusion models, conditioning strategies, architecture innovation, efficient sampling, and multimodal input handling—enabling practical, high-quality, and flexible image and video editing applicable across scientific, creative, and industrial domains.