Diffusion-Based Masked Inpainting

Updated 6 April 2026

Diffusion-based masked inpainting is a technique that uses diffusion processes—both classical PDE-based methods and modern DDPMs—to restore missing or corrupted regions in images and videos.
The approach leverages mask conditioning by integrating binary or soft masks into the restoration process, ensuring known pixels remain unchanged while generating high-fidelity completions for masked areas.
Applications span image, video, and 3D inpainting, with advances achieving significant speedups, enhanced perceptual quality, and reliable structural consistency across restored regions.

Diffusion-based masked inpainting refers to a family of algorithms for reconstructing missing, corrupted, or intentionally masked regions of images, videos, or higher-dimensional signals via diffusion-type processes, in which the restoration is directly conditioned on a user-specified binary or soft mask. Contemporary approaches include both classical linear PDE-based methods and modern deep learning, notably Denoising Diffusion Probabilistic Models (DDPMs) and their variants, which have become the dominant paradigm in the last several years for high-fidelity, controllable inpainting under arbitrary mask patterns. This article surveys the mathematical frameworks, neural architectures, optimization strategies, and notable advances specific to masked inpainting using diffusion processes, drawing on rigorous results and high-impact recent work.

1. Mathematical Foundations of Diffusion-Based Inpainting

The core principle in diffusion-based inpainting is the formulation of the inpainting task as a conditional generative process, leveraging either physical diffusion (PDE-based) or learned denoising dynamics (probabilistic diffusion).

Linear Homogeneous Diffusion

In classical schemes, an inpainting mask $K \subset \Omega$ and a grayscale image $f: \Omega \rightarrow \mathbb{R}$ are provided. Homogeneous diffusion inpainting seeks a reconstructed image $u$ that solves:

$(1-c(\mathbf{x})) \Delta u(\mathbf{x}) - c(\mathbf{x})(u(\mathbf{x}) - f(\mathbf{x})) = 0 \quad \mathbf{x}\in\Omega, \quad \partial_{\mathbf{n}}u=0 \text{ on } \partial\Omega$

Here, $c: \Omega\rightarrow\{0,1\}$ encodes the known mask ( $c=1$ on known, $c=0$ on unknown), and the boundary condition ensures Neumann reflection (Alt et al., 2021). Discretization yields a sparse linear system that can be efficiently solved, but mask sparsity and spatial selection critically affect solution quality.

Denoising Diffusion Probabilistic Models (DDPMs)

Modern masked inpainting with DDPMs (Ho et al., 2020) represents the data $x_0$ in pixel or latent space and applies a forward noising process:

$q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)$

for a scheduler $\{\beta_t\}$ , and reverse-time denoising:

$f: \Omega \rightarrow \mathbb{R}$ 0

where the network $f: \Omega \rightarrow \mathbb{R}$ 1 predicts the added noise. Mask conditioning is enacted at each denoising step; unmasked (known) pixels retain their original values, while masked pixels are updated by the model (Heidari et al., 2023, Gebre et al., 2024, Li et al., 17 Jan 2025, Xie et al., 21 Jan 2025).

2. Mask Conditioning and Architecture Design

Inpainting Mask Encoding

Best practices in mask conditioning involve directly concatenating the binary mask (or its soft relaxations) with input at every reverse step. For instance, DiffGANPaint and ESDiff concatenate the mask with input images as additional channels, maintaining per-pixel data fidelity by re-imposing known pixels after every denoising update (Heidari et al., 2023, Zhang et al., 24 Apr 2025). HarmonPaint and MTADiffusion further integrate masks into attention modules, differentiating structure and style between masked/unmasked regions (Li et al., 22 Jul 2025, Huang et al., 30 Jun 2025).

Neural Surrogates and Learned Mask Generators

To accelerate mask selection, Alt et al. propose a mask-generation U-Net that predicts soft mask confidences, which are then binarized by Bernoulli sampling at test time. This enables adaptive, efficient mask generation with substantial speed-up compared to stochastic optimization (Alt et al., 2021).

In higher dimensions (e.g., voxel inpainting), explicit mask channels are concatenated with geometry and color features, and diffusion steps are modulated to propagate mask-specific information (Sumuk, 1 Jan 2026).

3. Optimization Objectives and Loss Functions

Masked diffusion inpainting frameworks rely on a composite suite of losses:

Data fidelity (reconstruction) loss: MSE or BCE between inpainted output $f: \Omega \rightarrow \mathbb{R}$ 2 and original image $f: \Omega \rightarrow \mathbb{R}$ 3, restricted to the known/masked regions (Alt et al., 2021, Sumuk, 1 Jan 2026, Zhang et al., 24 Apr 2025).
PDE-residual or surrogate loss: Used to enforce approximate satisfaction of the underlying diffusion equations when direct solution is intractable or prohibitive (Alt et al., 2021).
Regularization for mask binarity/variance: Prevents degenerate solutions in mask learning pipelines (Alt et al., 2021).
Perceptual/style consistency loss: Gram-matrix losses in VGG-feature space enforce style coherence between masked/unmasked regions, critical in MTADiffusion and HarmonPaint (Huang et al., 30 Jun 2025, Li et al., 22 Jul 2025).
Edge-prediction and structure loss: Auxiliary heads predict edge maps to improve structural alignment in generated completions (Huang et al., 30 Jun 2025).

Classifier-free guidance, adversarial losses, and perceptual metrics may also be included depending on the application (e.g., text-guided inpainting or GAN acceleration (Heidari et al., 2023, Huang et al., 30 Jun 2025)).

4. Advanced Masking Strategies and Attention Mechanisms

Self-Attention Masking and Style Transfer

HarmonPaint introduces Self-Attention Masking (SAMS) in the encoder, partitioning cross- and within-region interactions in self-attention to segregate structure learning between the masked hole and background. In the decoder, masked-adjusted key/value statistics transfer global style cues from the context into the inpainted region, controlled by a tunable scaling parameter $f: \Omega \rightarrow \mathbb{R}$ 4 (Li et al., 22 Jul 2025).

Actual-Token Attention Loss (ATAL), proposed in PainterNet, penalizes misalignment between attention maps of prompt tokens and the masked region, encouraging the model to focus semantic coherence into the inpainted area (Wang et al., 2024).

Mask Sampling and Diversity

Explicit mask diversity during training and inference (e.g., segmentation, box, or scribble masks) enhances generalizability and performance under real-world editing conditions. Sampling strategies and mask rescaling are applied for density and coverage control (Alt et al., 2021, Wang et al., 2024).

For temporal/video inpainting, mask sequences are managed per frame and may be propagated or refined using optical flow (Li et al., 17 Jan 2025, Xie et al., 21 Jan 2025).

5. Theoretical Analysis and Guarantees

A rigorous analysis of masked diffusion inpainting, particularly the RePaint algorithm, reveals inherent misalignment biases that can prevent exact recovery of ground-truth samples under projection-based masking. The corrected algorithm, RePaint $f: \Omega \rightarrow \mathbb{R}$ 5, employs rescaled drift and noise updates that provably eliminate bias and achieve linear convergence rates, with convergence rates dictated by the spectral properties of the mask-subspace alignment (Rout et al., 2023). This theoretical insight explains empirical failures of uncorrected schemes and establishes mask-agnostic guarantees for certain classes of inpainting problems.

6. Applications, Quantitative Performance, and Practical Acceleration

Diffusion-based masked inpainting has demonstrated competitive or superior performance across a variety of domains:

Sparse image inpainting and compression: Learned mask generators and neural surrogates can match or nearly match combinatorial baselines while delivering $f: \Omega \rightarrow \mathbb{R}$ 6– $f: \Omega \rightarrow \mathbb{R}$ 7 speedup in mask selection (Alt et al., 2021).
3D and volumetric inpainting: Joint geometry and color completion in low-resolution voxel grids is enabled by explicit mask-conditioning in 3D U-Nets (Sumuk, 1 Jan 2026, Prabhu et al., 2023).
Few-shot and data-augmentation: Virtual mask encoding, mutual channel perturbations, and masked diffusion synthesis improve downstream classifier generalization and segmentation with limited annotation (Zhang et al., 24 Apr 2025, Jin et al., 2024, Hu et al., 28 Jun 2025).
Medical and scientific imaging: Masked diffusion inpainting mitigates spurious correlations by repopulating backgrounds with labels or styles from target domains, confirmed by domain-transfer metrics (Jin et al., 2024).
Video inpainting: Training-free (VipDiff), prior-augmented (DiffuEraser), and transformer/diffusion hybrids (FFF-VDI) outperform flow-based propagation in spatiotemporal coherence and perceptual FID, with diversity and temporal consistency controlled by mask-aware sampling and optimization (Li et al., 17 Jan 2025, Xie et al., 21 Jan 2025, Lee et al., 2024).

7. Limitations, Extensions, and Future Directions

Despite their versatility, diffusion-based masked inpainting methods exhibit known limitations:

Computational cost remains significant, especially for high-resolution and video inpainting, though GAN surrogates and lighter U-Nets can achieve order-of-magnitude acceleration with moderate quality trade-off (Heidari et al., 2023, Li et al., 22 Jul 2025).
Masked region size and content type can challenge semantic coherency, particularly for large unknowns or atypical context, motivating research on adaptive seed initialization (IS-Diff) and dynamic refinement mechanisms (Lyu et al., 15 Sep 2025, Li et al., 22 Jul 2025).
Full provable guarantees are available principally in linear or subspace regimes; empirical generalization to unstructured real-world data often relies on architectural heuristics and is an active area of research (Rout et al., 2023).
Extending text/image alignment (MTADiffusion, PainterNet) and style harmonization (HarmonPaint) is necessary for prompt-based, user-controllable inpainting in editing and creative applications (Huang et al., 30 Jun 2025, Wang et al., 2024).

Further investigation into learning-free harmonization, robust mask encoding, and efficient conditional sampling continues to drive the field toward more general, controllable, and fast diffusion-based masked inpainting.

Key References

Area	Title/Citation	arXiv ID
Mask learning	Learning Sparse Masks for Diffusion-based Image Inpainting	(Alt et al., 2021)
3D inpainting	Mask-Conditioned Voxel Diffusion for Joint Geometry ...	(Sumuk, 1 Jan 2026)
GAN-acceleration	DiffGANPaint: Fast Inpainting Using Denoising Diffusion GANs	(Heidari et al., 2023)
Medical transfer	Masked Medical Image Inpainting with Diffusion Models ...	(Jin et al., 2024)
Harmonization	HarmonPaint: Harmonized Training-Free Diffusion Inpainting	(Li et al., 22 Jul 2025)
Seed selection	IS-Diff: Improving Diffusion-Based Inpainting ...	(Lyu et al., 15 Sep 2025)
Few-shot/data aug	ESDiff: Encoding Strategy-inspired Diffusion Model ...	(Zhang et al., 24 Apr 2025)
Theory/convergence	A Theoretical Justification for Image Inpainting ...	(Rout et al., 2023)
Text/mask alignment	MTADiffusion: Mask Text Alignment Diffusion Model ...	(Huang et al., 30 Jun 2025)
Video inpainting	DiffuEraser: A Diffusion Model for Video Inpainting	(Li et al., 17 Jan 2025)
Training-free video	VipDiff: Towards Coherent and Diverse Video Inpainting...	(Xie et al., 21 Jan 2025)