Training-Free Controllable Inpainting Network

Updated 17 November 2025

The paper demonstrates a training-free framework that leverages pre-trained diffusion and autoregressive models to achieve precise spatial, semantic, and style inpainting.
It employs tailored attention mechanisms, mask-driven spatial guidance, and latent optimization to facilitate multi-region editing without additional fine-tuning.
Quantitative results indicate improved LPIPS, FID, and SSIM metrics across image, video, and 3D tasks, underscoring its practical deployment and efficiency.

Training-free controllable inpainting networks encompass a family of methods that enable precise spatial and semantic control over image, video, and 3D inpainting without any task-specific fine-tuning or additional training. These frameworks leverage the architecture and generative priors of pre-trained models—primarily diffusion or autoregressive backbones—supplemented with tailored attention manipulations, prompt-guidance strategies, or optimization objectives. The motivation is to achieve high-fidelity content manipulation, object removal, content creation, text editing, and style harmonization in a purely inference-time regime. This paradigm circumvents the need for extensive retraining while offering granular user control, multi-region editing, and, in select methods, per-pixel spatial guidance or diverse sample generation.

1. Architectures and Conditioning Regimes

Training-free controllable inpainting approaches inherit their generative scaffolding from existing pre-trained models, such as Stable Diffusion and SDXL UNet-VAEs (Jeon, 6 Mar 2025), Mask AutoRegressive (MAR) models (Jiang et al., 28 Sep 2025), latent diffusion frameworks (Salar et al., 18 Sep 2025, Zheng et al., 5 Feb 2025), and NeRF variants (Liu et al., 2022). The control mechanisms vary:

Prompt embeddings: ControlFill (Jeon, 6 Mar 2025) trains two distinct prompt embeddings, $y_c$ (creation) and $y_r$ (removal), which directly condition the UNet’s cross-attention layers, bypassing the need for heavy text encoders. HarmonyPaint (Li et al., 22 Jul 2025) and Token Painter (Jiang et al., 28 Sep 2025) operate through mask- or region-based attention modifications without explicit prompt embeddings.
Mask-driven spatial control: Binary or signed masks are used to designate regions for removal, creation, or style transfer (Jeon, 6 Mar 2025, Li et al., 22 Jul 2025, Xie et al., 21 Jan 2025). Advanced frameworks (Token Painter (Jiang et al., 28 Sep 2025), ControlFill (Jeon, 6 Mar 2025)) include per-pixel adjustable intensity.
Latent optimization: A subset of methods perform direct optimization of the generative latent at each timestep using custom loss terms—e.g., OmniText (Gunawan et al., 28 Oct 2025) employs content and style-attention losses for controllable text editing/manipulation.

2. Spatial, Semantic, and Style Control

Achieving controllability without retraining hinges on manipulations in the conditioning and attention interfaces of the generative backbone:

Classifier-Free Guidance (CFG): ControlFill implements both standard global CFG and spatially varying CFG, where each mask region is assigned an intention (+1 for creation, –1 for removal) and receives per-pixel guidance scaling. Pseudocode formalizes the denoising update:
1 2 3 4
ε_c = ε_θ(z_t, y_c, t) ε_r = ε_θ(z_t, y_r, t) ε̂ = w·M⊙ε_c + (1−w·M)⊙ε_r z_{t−1} = λ_t z_t – β_t ε̂
Attention-level manipulations: HarmonPaint (Li et al., 22 Jul 2025) modifies self-attention via masking strategies (SAMS), partitioning encoder attention into background/background, hole/hole, and suppressing cross-region flow. The decoder enforces global style harmonization via Mask-Adjusted Key-Value Strategy (MAKVS), replacing keys/values of masked regions with their mean in the unmasked background and mixing using a strength parameter $\lambda$ .
Frequency/Statistical Fusion: Token Painter (Jiang et al., 28 Sep 2025) combines semantic and contextual guidance by fusing text and image-background features in the frequency domain (DEIF), followed by adaptive boosting of cross- and self-attention weights (ADAE) to improve content faithfulness and intra-hole coherence.
Attention inversion/reassignment for text: OmniText (Gunawan et al., 28 Oct 2025) applies self-attention inversion and cross-attention reassignment to suppress "text hallucinations" in removal tasks, further controlling content tokens and style via custom loss functions.

3. Mathematical Formulation and Optimization Routines

Several frameworks formalize inpainting as exact conditional sampling or latent optimization:

Conditional diffusion as Langevin dynamics: LanPaint (Zheng et al., 5 Feb 2025) introduces BiG (Bidirectional Guided) score functions and solves for exact $p(x_0 | y_0)$ using coupled Langevin SDEs, ensuring statistical correctness and avoiding local likelihood traps. Accelerated Fast Langevin Dynamics (FLD) yield rapid convergence during each ODE/diffusion step, producing high-fidelity, boundary-consistent inpainting.
Latent optimization for content/style: OmniText (Gunawan et al., 28 Oct 2025) computes gradient updates for latent $z_t$ based on combined cross-attention content loss $\mathcal{L}_C$ and self-attention style loss $\mathcal{L}_S$ , modulated by hyperparameters $\lambda_C$ , $\lambda_S$ :

$\mathcal{L}(z_t) = \lambda_C\,\mathcal{L}_C(z_t) + \lambda_S\,\mathcal{L}_S(z_t), \quad z_t \leftarrow z_t - \eta\,\mathrm{Adam}(\nabla_{z_t}\mathcal{L}(z_t))$

Multi-view volumetric optimization: NeRF-In (Liu et al., 2022) updates the weights of pre-trained NeRF models by minimizing color-guidance and depth-guidance losses across both the user-chosen and auxiliary views, using RGB-D priors as references for plausible geometry and appearance.

4. Practical Deployment and User Interactivity

Deployment of training-free controllable inpainting offers several practical advantages:

Single-pass multi-region editing: ControlFill (Jeon, 6 Mar 2025) and HarmonPaint (Li et al., 22 Jul 2025) allow simultaneous, spatially varying intentions using a mask tensor $M$ and associated scaling vector $w$ . No additional forward passes are required for multi-region edits.
Resource efficiency: All methods forgo any fine-tuning; large text encoders are dropped at inference (ControlFill), and LoRA adapters keep model footprints modest. For MAR architectures (Token Painter), all model weights are frozen, and control is achieved through inference-time feature/attention manipulation.
On-device suitability: The elimination of runtime text encoders and compact parameter structures permit real-time or edge deployment scenarios.
Localized and attribute-aware inpainting: Fine spatial control is exemplified in face anonymization (Salar et al., 18 Sep 2025), which restricts denoising and guidance to user-selected semantic regions, supporting medical data privacy.

5. Quantitative Performance and Comparisons

Empirical evaluations across multiple datasets and tasks demonstrate the effectiveness and versatility of these training-free inpainting networks.

Method	Task	Key Metric(s)	Result (Best in Bold)
ControlFill (Jeon, 6 Mar 2025)	Removal (RORD)	LPIPS, FID, Depth_REL	0.1068, 54.24, 1.069
Token Painter (Jiang et al., 28 Sep 2025)	Image Inpaint	PickScore, HPSv2, SSIM	55.37, 23.00, 9.41
LanPaint (Zheng et al., 5 Feb 2025)	Latent Inpainting	LPIPS (CelebA-HQ, N=10)	0.189
HarmonPaint (Li et al., 22 Jul 2025)	Style Harmonize	CLIP Score (Stylized-MSCOCO)	28.86
VipDiff (Xie et al., 21 Jan 2025)	Video Inpaint	PSNR, SSIM, VFID	34.21, 0.9773, 0.041
OmniText (Gunawan et al., 28 Oct 2025)	Text Manip.	MS-SSIM, PSNR, FID (removal)	95.71, 29.52, 39.06

ControlFill and Token Painter demonstrate strong numerical gains in LPIPS, FID, PickScore, and SSIM, while HarmonPaint registers notable improvements in CLIP Score and style harmony. LanPaint achieves the lowest LPIPS relative to all baseline inpainting engines. VipDiff sets benchmarks in temporal PSNR, SSIM, and video FID, exhibiting spatial-temporal consistency in video inpainting. OmniText, in text-image manipulation, delivers state-of-the-art results in both text removal and editing, matching specialist baselines in MS-SSIM and FID.

6. Limitations, Ablations, and Future Directions

While training-free, controllable inpainting has advanced semantic and spatial editability, several unresolved challenges remain:

Prompt specificity and semantic granularity: ControlFill’s creation mode currently lacks the ability to specify fine-grained object classes; expansion to multiple, class-specific embeddings is an open area.
Extremely large mask regions: Style transfer and harmony (HarmonPaint) may fail if masked regions leave insufficient unmasked context.
Temporal diversity vs. speed: VipDiff offers enhanced diversity but at the cost of increased per-frame optimization latency.
Model capacity constraints: Token Painter’s performance on ultra-large masks or high-res images is bounded by underlying MAR and VQ-VAE capacities.
Transfer to video/3D domains: Extension of FFT-domain fusion, adaptive attention boosting, and spatially varying CFG to spatio-temporal or volumetric representations presents both computational and modeling challenges.

A plausible implication is that further development will include learned, per-image or per-region adapters (HarmonPaint, Token Painter), hybrid latent optimization strategies (OmniText), and augmented auxiliary objectives or losses to better support style, semantics, and spatial consistency—potentially across video and multi-view scenes.

7. Relation to Prior Work and Research Trajectories

The surveyed methods represent a convergence of architectural re-use, conditional inference formalism, and attention-level manipulation. Early approaches (NeRF-In (Liu et al., 2022)) established free-form 3D inpainting by direct optimization of radiance fields with RGB-D priors. Subsequent image-diffusion frameworks (ControlFill (Jeon, 6 Mar 2025), LanPaint (Zheng et al., 5 Feb 2025)) have prioritized “plug-and-play” control through prompt learning and exact Langevin-based inference. Attention-based harmonization (Li et al., 22 Jul 2025) and MAR-guided text-image control (Jiang et al., 28 Sep 2025, Gunawan et al., 28 Oct 2025) have further broadened applicability to multimodal, multi-task inpainting.

Contributing research groups have notably included developers of Stable Diffusion, SDXL/Latent DDPMs, and NeRF (UC Berkeley, University of Tübingen, Tencent AI Lab), as well as specialists in fast conditional inference and autoregressive token modeling.

Training-free controllable inpainting networks now serve as a key substrate for interactive and on-device semantic editing, privacy protection in medical/face-centric domains, style harmonization in content creation, and generalist text-image manipulation across print, apparel, and packaging. Continued adoption and refinement in these areas is anticipated.