Training-Free Controllable Inpainting

Updated 22 November 2025

Training-free controllable inpainting networks are vision models that leverage fixed, pretrained generative models to manipulate masked image regions with user-specified guidance.
They employ methods like Langevin dynamics, latent optimization, and attention modulation to ensure pixel-level fidelity and semantic control without model retraining.
These networks enable diverse applications including localized face anonymization, style-aware art inpainting, and prompt-controlled text–image manipulations.

Training-free controllable inpainting networks are a class of vision models that enable precise, user-directed completion or editing of masked regions within images or videos without any additional model retraining or fine-tuning. By directly leveraging pretrained generative models—principally denoising diffusion models (DDPM/DDIM), latent diffusion models (LDM), or mask autoregressive (MAR) transformers—these networks provide pixel-level or semantic controllability through externally applied objectives, attention modulation, or guidance policies. This approach contrasts with conventional finetuning-based inpainting, allowing users to adapt pretrained networks to new inpainting tasks or domains at inference time, such as localized face anonymization, style-aware art inpainting, prompt-controlled object removal/creation, and multifaceted text–image manipulations.

1. Core Principles and Problem Formulation

Training-free controllable inpainting repurposes frozen generative models to sample from the posterior distribution over missing pixels conditioned on observed data—without updating model weights. The underlying principle is conditional generation: $p(x_\text{miss} \mid x_\text{obs}) \propto p(x_\text{complete}),$ where $x_\text{miss}$ are unknown masked pixels, $x_\text{obs}$ are known pixels, and $x_\text{complete}$ is the composite image. This conditioning may be enforced through score-based sampling (Langevin dynamics), optimization of initial noise or latent variables, or attention-based control mechanisms. The challenge is to ensure both high-fidelity reconstructions consistent with the context and user-specified control over the semantics, style, or constraints of the inpainted region, all without training-phase access to relevant masks or content (Zheng et al., 5 Feb 2025).

2. Methodological Taxonomy of Training-Free Approaches

Distinct algorithmic paradigms have coalesced around training-free controllable inpainting, including:

A. Langevin Dynamics and Score-Based Conditional Sampling

LanPaint (Zheng et al., 5 Feb 2025) samples missing regions via Langevin or underdamped Langevin (FLD) dynamics, leveraging the pretrained diffusion model's score function $s_\theta(z, t) = \nabla_z \log p_t(z)$ . A bidirectional guided score ("BiG"-score) synthesizes contextual coherence via direct feedback from both observed and masked pixels. FLD accelerates convergence and preserves posterior exactness.

B. Reverse Diffusion with Noise or Latent Optimization

VipDiff (Xie et al., 21 Jan 2025) for video inpainting and various face anonymization methods (Salar et al., 18 Sep 2025) treat the latent noise initialization $z$ or per-timestep latent $z_t$ as a variable to be optimized, aligning model outputs with conditionally known pixels (e.g., warped pixels from neighboring frames, or reference appearance embeddings via perceptual or feature losses). These methods perform gradient descent on $z$ in the diffusion reverse pass to enforce fidelity and user-specified control.

C. Attention Modulation and Mask-Aware Strategies

Frameworks such as HarmonPaint (Li et al., 22 Jul 2025) and OmniText (Gunawan et al., 28 Oct 2025) modify or partition the self-attention and cross-attention mechanisms within the U-Net. Masked Self-Attention Separation (SAMS) prevents information flow between masked and unmasked regions to preserve structure, while Key-Value Modulation (MAKVS) transfers style statistics from background to masked areas for stylistic harmony. OmniText further enables precise content and style placement via latent optimization augmented by cross-attention content and self-attention style losses.

D. Prompt Embedding and Classifier-Free Guidance

ControlFill (Jeon, 6 Mar 2025) circumvents text encoders by learning task-specific prompt embeddings ("creation" and "removal") and injecting them into the diffusion U-Net using adjustable classifier-free guidance. This allows explicit per-pixel or per-region blending between background extension and object creation, spatially and globally controlled at inference.

E. Mask Autoregressive Models and Token-Level Attention

Token Painter (Jiang et al., 28 Sep 2025) exploits MAR models' ability to inpaint only selected tokens, preserving the context exactly. Dual-Stream Encoder Information Fusion (DEIF) and Adaptive Decoder Attention Score Enhancing (ADAE) fuse text-background and text-only signals, sharpening or relaxing attention over semantic and spatial regions to realize text-faithful, context-coherent completions.

3. Control Mechanisms and User Interfaces

All training-free controllable inpainting networks expose explicit user-facing control mechanisms:

Spatial masks: Users define the spatial support of inpainting, locally or globally.
Prompt embeddings & guidance: ControlFill enables users to set the weight of object creation vs. background extension, both globally and at fine spatial granularity, via guidance scales and spatial maps (Jeon, 6 Mar 2025).
Attention hooks and style/structure weighting: HarmonPaint exposes soft-mask smoothing $\tau$ and style blending parameter $\lambda$ , while scheduling the application of structure/style control across denoising steps via ratio $\eta$ (Li et al., 22 Jul 2025).
Latent optimization objectives: OmniText enables separate weighting of content ( $\lambda_C$ ) and style ( $\lambda_S$ ) losses, with per-character and region submasking for precise text editing (Gunawan et al., 28 Oct 2025).
Noise or latent code initialization: VipDiff samples diverse plausible inpaintings by varying initial latent seeds, supporting stochasticity and diversity (Xie et al., 21 Jan 2025).

Many frameworks allow style references, region-specific guidance, or text prompts to specify the content, appearance, or function of the fill operation. In localized face anonymization (Salar et al., 18 Sep 2025), masks can target precise facial features (e.g., leaving eyes unaltered), enabling a tunable privacy–utility continuum.

4. Technical Implementations

The following table summarizes key mechanisms underlying major training-free controllable inpainting models:

Framework	Model Class	Control Means	Core Mechanism
LanPaint	DDPM/DDIM	Mask, λ, ODE/SDE sampler	Langevin/FLD conditional sampling (Zheng et al., 5 Feb 2025)
VipDiff	DDPM (video)	Optical flow, noise z, loss weighting γ	Noise-optimized reverse diffusion (Xie et al., 21 Jan 2025)
ControlFill	LDM (SDXL)	Prompt embeddings, guidance scales, spatial maps	Prompt blending via classifier-free guidance (Jeon, 6 Mar 2025)
Token Painter	MAR (VQ-VAE)	Text prompt, attention λ₁/λ₂/λ₃, background	Attention modulation and token fusion (Jiang et al., 28 Sep 2025)
HarmonPaint	LDM (SD-inpaint)	Mask M, style λ, mask smooth τ, phase η	SAMS (structure) / MAKVS (style transfer) (Li et al., 22 Jul 2025)
OmniText	LDM (TextDiff-2)	Style/content λ_C, λ_S, submask, grid input	Attention manipulation & latent optimization (Gunawan et al., 28 Oct 2025)

All approaches are compatible with off-the-shelf pretrained diffusion or MAR models. Some methods, notably ControlFill, freeze text encoders and inject only low-dimensional embeddings for efficiency.

5. Evaluation Protocols and Empirical Performance

Empirical benchmarks typically quantify performance using metrics such as:

Fidelity: PSNR, SSIM (per-frame, per-pixel)
Perceptual realism: FID, VFID, LPIPS, CLIP Score, Image Reward
Temporal consistency: E_warp (video), V-DNA (distributional neuron activations)
Semantic accuracy: Rendering accuracy (ACC), Normalized Edit Distance (NED) for text inpainting
Style coherence: MSE, PSNR, MS-SSIM on stylized masks

Notable experimental findings:

VipDiff exceeds all tested end-to-end and flow-guided video inpainting networks on YouTube-VOS/DAVIS: PSNR ≈ 34.2, SSIM ≈ 0.97, VFID ≈ 0.041/0.102, E_warp ≈ 0.083/0.128 (Xie et al., 21 Jan 2025).
LanPaint achieves state-of-the-art LPIPS on ImageNet-256 (0.193, box mask) and demonstrates ≈4× faster convergence than prior training-free methods (Zheng et al., 5 Feb 2025).
HarmonPaint produces the highest Aesthetic Score (AS = 6.55) and lowest CLIP-MMD (CMMD = 0.103) across stylized inpainting tasks (Li et al., 22 Jul 2025).
OmniText attains best-in-class rendering and style control on text-editing benchmarks, with ACC 78.44%, NED 0.951, PSNR 14.85, and FID 31.69 on ScenePair, matching specialist methods (Gunawan et al., 28 Oct 2025).
Token Painter achieves superior background PSNR (26.39) and lowest LPIPS (42.27×10⁻³) on BrushBench compared to other MAR- and diffusion-based baselines (Jiang et al., 28 Sep 2025).
Localized face anonymization (Salar et al., 18 Sep 2025) attains Re-ID 0.015 (CelebA-HQ), FID 23.19—outperforming GAN- and fine-tuned diffusion-based anonymizers while permitting region-selective privacy protection.

Ablation studies consistently demonstrate the importance of (i) guided score tuning (λ, noise optimization), (ii) attention modulation for harmonization, and (iii) per-region or per-pixel control to balance fidelity and stylistic/semantic alignment.

6. Applications and Extensions

Training-free controllable inpainting networks support a broad array of tasks:

Video inpainting and completion: Spatiotemporal hole-filling with optical flow priors and temporal consistency (Xie et al., 21 Jan 2025).
Local/semantically controlled editing: Region-specific face anonymization, text or object removal and creation (Salar et al., 18 Sep 2025, Jeon, 6 Mar 2025, Gunawan et al., 28 Oct 2025).
Artistic style transfer and harmonization: Context-aware content completion that preserves both structure and visual style, including localized style transfer (Li et al., 22 Jul 2025).
Text–image manipulation: Removal, insertion, repositioning, and style-constrained synthesis of text in real-world images, with evaluation on custom benchmarks (OmniText-Bench) (Gunawan et al., 28 Oct 2025).
Interactive and real-time editing scenarios: Parameter sweeps and tuning for live visual feedback during user-driven image manipulation.

Extensions also include pixel-wise conditioning for super-resolution, photorealistic editing, text-conditioned inpainting, and even generalized mask-based autoregressive token completions.

7. Limitations and Outlook

Limitations of existing training-free controllable inpainting networks include the need for robust masks or reference signals (e.g., reliable face parsing or style exemplars), potential artifacts at mask boundaries under extreme parameterization, and compute intensity inherent to latent or noise optimization. Dependency on the representational coverage of the frozen backbone model may restrict performance on out-of-distribution or highly detailed tasks.

Open directions include broader integration with language-driven control (beyond prompt embeddings), further acceleration of reverse-path optimization (e.g., via distilled or low-rank diffusion models), and cross-modal extensions (e.g., audio-visual or 3D inpainting). The field is rapidly advancing as new architectures (e.g., MAR, improved LDMs), control interfaces, and evaluation frameworks emerge.

For comprehensive technical details and benchmarks, refer to LanPaint (Zheng et al., 5 Feb 2025), VipDiff (Xie et al., 21 Jan 2025), ControlFill (Jeon, 6 Mar 2025), HarmonPaint (Li et al., 22 Jul 2025), Token Painter (Jiang et al., 28 Sep 2025), OmniText (Gunawan et al., 28 Oct 2025), and the Localized Face Anonymization framework (Salar et al., 18 Sep 2025).