FreeInpaint: Tuning-free Prompt Alignment and Visual Rationality Enhancement in Image Inpainting (2512.21104v1)

Published 24 Dec 2025 in cs.CV

Abstract: Text-guided image inpainting endeavors to generate new content within specified regions of images using textual prompts from users. The primary challenge is to accurately align the inpainted areas with the user-provided prompts while maintaining a high degree of visual fidelity. While existing inpainting methods have produced visually convincing results by leveraging the pre-trained text-to-image diffusion models, they still struggle to uphold both prompt alignment and visual rationality simultaneously. In this work, we introduce FreeInpaint, a plug-and-play tuning-free approach that directly optimizes the diffusion latents on the fly during inference to improve the faithfulness of the generated images. Technically, we introduce a prior-guided noise optimization method that steers model attention towards valid inpainting regions by optimizing the initial noise. Furthermore, we meticulously design a composite guidance objective tailored specifically for the inpainting task. This objective efficiently directs the denoising process, enhancing prompt alignment and visual rationality by optimizing intermediate latents at each step. Through extensive experiments involving various inpainting diffusion models and evaluation metrics, we demonstrate the effectiveness and robustness of our proposed FreeInpaint.

Abstract PDF Chat (Pro)

Summary

The paper introduces a tuning-free method that directly optimizes diffusion latents to improve prompt alignment in image inpainting.
It employs Prior-Guided Noise Optimization and Decomposed Training-Free Guidance to focus attention on masked regions and adjust visual outputs in real time.
Experimental results on datasets like EditBench and MSCOCO demonstrate substantial gains in ImageReward, CLIPScore, and InpaintReward metrics.

Detailed Summary of "FreeInpaint: Tuning-free Prompt Alignment and Visual Rationality Enhancement in Image Inpainting"

Introduction

The paper "FreeInpaint: Tuning-free Prompt Alignment and Visual Rationality Enhancement in Image Inpainting" (2512.21104) introduces a novel approach to the text-guided image inpainting problem. Traditional methods often struggle to balance prompt alignment and visual rationality. Existing models typically rely on pre-trained diffusion models, which, despite generating visually appealing outcomes, fall short of achieving precise alignment with user-defined text prompts. FreeInpaint circumvents these challenges by directly optimizing diffusion latents during inference without the need for additional tuning or training.

Methodology

FreeInpaint's methodology is structurally divided into two main components: Prior-Guided Noise Optimization (PriNo) and Decomposed Training-Free Guidance (DeGu).

Prior-Guided Noise Optimization

PriNo addresses the issue of prompt misalignment by optimizing initial noise, thereby steering the model's attention towards regions requiring inpainting. Unlike traditional approaches that experience scattered or misdirected attention, PriNo focuses on concentrating cross and self-attention on the masked regions. This optimization is performed at the initial stages of denoising, ensuring early incorporation of prompt-specific features.

Figure 1: Visualization of cross-attention (cols 3-4) and self-attention (cols 5-6). Row 2 shows misdirected attention causing an unaligned result, while row 3 shows our optimized noise concentrates attention for a successful alignment.

Decomposed Training-Free Guidance

DeGu further refines inpainting by decomposing the conditioning of the inpainting process into three separate objectives: text alignment, visual rationality, and human preference. Each objective is steered using differentiable reward models, allowing the diffusion latents to be adjusted in real-time during the denoising process. This decomposition facilitates task-specific alignment, enhancing the overall coherence and aesthetic quality of the output.

Experimental Results

The effectiveness of FreeInpaint is demonstrated through extensive experiments across various datasets such as EditBench and MSCOCO. The model is evaluated against multiple baselines, including SDI, PPT, BN, SDXLI, and SD3I, showcasing substantial improvements in metrics related to human preference (ImageReward), prompt alignment (CLIPScore), and visual rationality (InpaintReward).

Figure 2: Comparisons between our FreeInpaint and existing methods. FreeInpaint simultaneously enhances prompt alignment and visual rationality.

Sensitivity Analysis

The paper also investigates the sensitivity of PriNo's self-attention loss weight and DeGu's guidance weights, confirming that FreeInpaint maintains robustness across a wide range of hyperparameter settings. This flexibility underlines its adaptability to different inpainting tasks and architectures, including both U-Net and transformer-based models.

Figure 3: The sensitivity analysis of the $\mathcal{L}_\text{s}$ weight $\lambda_2$ .

Conclusion

FreeInpaint establishes a practical and effective framework for image inpainting, uniquely combining tuning-free optimization techniques with strategically decomposed guidance goals. It achieves a balanced improvement across multiple critical dimensions of inpainting, setting a precedent for future tuning-free methodologies in the domain of text-guided image editing.

The proposed method offers a significant leap forward by achieving superior prompt alignment and visual rationality without additional model tuning or retraining needs, thereby advancing the practicality and accessibility of high-quality text-guided image manipulation models in various applications.