Guided Sampling & Inpainting Techniques
- Guided sampling and inpainting are advanced methods that reconstruct missing image regions by incorporating external guidance to ensure semantic fidelity and control.
- These approaches leverage architectures like dual-branch encoder-decoders and diffusion models to seamlessly integrate texture, structure, and semantic cues.
- Feature alignment and multi-candidate patch guidance overcome challenges such as semantic drift and detail suppression, improving metrics like PSNR and SSIM.
Guided sampling and inpainting refer to techniques for reconstructing missing regions in images (or other data modalities) by actively conditioning the generative process on auxiliary information—such as reference images, semantic masks, prompts, sketches, or classifier signals—rather than relying solely on unconditional priors. These approaches leverage explicit guidance throughout the network architecture, loss functions, or inference strategy, to improve controllability and semantic fidelity in filled regions. Guided sampling can be formalized across multiple frameworks, including deep encoder–decoders, diffusion-based denoisers, feature-alignment modules, and hybrid patch-transfer systems.
1. Architectural Principles of Guided Inpainting
Guided inpainting architectures are built to integrate external priors at critical stages of processing. Notably, the dual-branch encoder–decoder introduced by "Reference-Guided Texture and Structure Inference for Image Inpainting" (Liu et al., 2022) separates texture and structure extraction, employing two parallel encoders for the masked input and the reference image. Shallow layers focus on texture, deeper layers on structure. After feature extraction, multi-scale partial convolution streams fill the hole using only known pixels, promoting spatial consistency.
Other works, such as "Interactive Image Inpainting Using Semantic Guidance" (Yu et al., 2022), advocate a two-stage approach: an external spatial attention (ESPA) autoencoder first reconstructs features, which are used to produce a coarse fill. This is followed by a semantic SPADE-based decoder guided by a user-edited semantic mask, allowing disentanglement of low-level color/texture priors from high-level layout control.
Diffusion models increasingly underpin guided inpainting frameworks, enabling fine-grained manipulation of content at each denoising step. SmartBrush (Xie et al., 2022) integrates both text and mask conditioning into each cross-attention block and applies an object-mask prediction head to sharpen background preservation. ControlFill (Jeon, 6 Mar 2025) utilizes learned prompt embeddings for “creation” and “removal,” enabling spatially adjustable guidance at the pixel level without per-inference text encoding.
2. Feature Alignment, Patch Guidance, and Sampling Modules
Feature alignment is a cornerstone of reference-guided inpainting. The Feature Alignment Module (FAM) of (Liu et al., 2022) couples dynamic offset estimation with deformable convolution, warping reference features to match the input’s missing-region geometry. For every hole location, FAM searches for the most similar neighborhood in the reference and transfers features via learned offsets, which are then blended through multi-scale partial convolutions.
Patch-guided approaches, exemplified by SuperCAF (Zhang et al., 2022), operate at high resolution by first producing a plausible low-res fill, extracting high-res structure, segmentation, and depth guides, and running multiple PatchMatch synthesizers—each set to use a different combination of guides. A learned curation network aggregates paired comparisons of eight candidates to automatically select the most realistic high-frequency fill. This multi-guide, multi-candidate paradigm addresses the limitations of pure PatchMatch (semantic leaking, broken structures, focus mismatches) and deep models (low-res smoothness).
Similarity-based guidance, used in SimDPS for music (Turland et al., 19 Sep 2025), retrieves candidate segments from a corpus and incorporates them as soft targets in the posterior likelihood gradient during diffusion inference. This hybridizes local generative plausibility with global structural anchoring using feature-based similarity search and weighted posterior updates.
3. Strategies for Guided Sampling During Inference
Guided sampling employs explicit mechanisms at inference time to steer the generation toward desired semantics or structures. In diffusion models, the process is typically modulated either through gradient-based interventions or sample blending.
In deformable alignment-based methods (Liu et al., 2022), the guided sampler leverages learned offsets to transfer features from the reference to the input’s missing region, resulting in soft patch sampling rather than pixel copy-paste. In SmartBrush (Xie et al., 2022), classifier-free guidance is used with text and mask conditions, blending unconditional and text-conditional predictions at each step. The mask-prediction head allows for dynamical sharpening of the background mask during denoising.
Several recent methods apply gradient guidance via backpropagation: GradPaint (Grechka et al., 2023) computes a harmonization loss at each denoising step—measuring the coherence of the denoised image with the input—and backpropagates to nudge the latent toward better context harmony, reducing visible seams and artifacts.
ControlFill (Jeon, 6 Mar 2025) extends classifier-free guidance spatially, interpolating between “creation” and “removal” prompt embeddings at the pixel level according to a ternary mask, producing fine-grained control without requiring per-inference text encoding.
GuidPaint (Wang et al., 29 Jul 2025) incorporates classifier score gradients during both stochastic and deterministic sampling phases, letting users select candidate fills and refine them with precise semantic guidance in the masked regions. LAR-Gen (Pan et al., 28 Mar 2024) supports multimodal guidance (text and image) via decoupled cross-attention, balancing semantic and identity control with a tunable weight .
4. Training Objectives and Evaluation Protocols
Training losses in guided inpainting frameworks enforce both pixel-level and semantic consistency. Structural and textural supervision are applied via feature matching (e.g. to RTV-extracted structure maps and ground-truth images) in (Liu et al., 2022). Perceptual and style losses, typically using VGG features and Gram matrices, further regularize textural fidelity.
Advanced adversarial objectives—such as Relativistic Average Least-Squares in (Liu et al., 2022) or WGAN-GP in (Yoon et al., 2023)—are used to boost realism, often alongside tailored reconstruction and mask consistency losses.
Quantitative metrics include PSNR, SSIM, FID, LPIPS, CLIP-score, and semantic segmentation accuracy. For example, (Liu et al., 2022) reports PSNR/SSIM improvements across various hole ratios, demonstrating the efficacy of guided reference feature sampling. SuperCAF (Zhang et al., 2022) delivers improvements up to 7.4× over baselines on high-res boundary patches.
Qualitative evaluation protocols may involve expert listener panels (for audio inpainting (Turland et al., 19 Sep 2025)), human preference studies for image realism and prompt fidelity (Wang et al., 2022, Manukyan et al., 2023), and ablations demonstrating the contribution of feature, mask, and prompt-based guidance.
5. Modalities of Guidance: Reference Images, Text Prompts, Semantic Masks, and Sketches
Reference-guided methods utilize actual paired images covering the same scene for feature transfer (Liu et al., 2022, Yoon et al., 2023), achieving sharper reconstruction and improved semantic alignment versus single-image priors.
Text-guided inpainting uses high-level descriptions to guide object or attribute generation within masks. CAT-Diffusion (Chen et al., 12 Sep 2024) decomposes object inpainting into semantic feature prediction via a CLIP-aligned Transformer, followed by diffusion denoising guided by visual prompts. PAIntA and RASG (Manukyan et al., 2023) combine prompt-aware attention rescaling and cross-attention energy guidance to enforce alignment between prompt tokens and image regions.
Semantic masks (Yu et al., 2022) and object-masks (Wang et al., 2022) enable precise spatial control over inpainted content, with SPADE-based decoders or mask-aware diffusion heads translating user edits into fine-grained layout constraints.
Sketch-based guidance (Sharma et al., 18 Apr 2024) leverages partial discrete diffusion, fusing transformer embeddings of sketch strokes and masked image tokens to deliver inpainted objects consistent with drawn pose and shape, achieving state-of-the-art FID and sketch-faithfulness.
6. Limitations, Extensions, and Generalization
Guided sampling’s effectiveness depends on the quality and specificity of the auxiliary guidance. Reference-guided methods may falter with poor or unrelated references, while text-prompt approaches can struggle with rare attributes or counting/shaping tasks (Wang et al., 2022). Overly rigid guidance may suppress detail, while weak guidance risks semantic drift.
Extensions include video inpainting with temporal consistency modules, joint attribute-motif learning, and integration of richer vector or stroke embeddings for sketch guidance (Sharma et al., 18 Apr 2024). Similarity-based hybrid methods in audio are suggested for visual modalities, exploiting corpus retrieval for global structure (Turland et al., 19 Sep 2025).
The general paradigm is applicable beyond images and audio, e.g. masked diffusion LLMs (dLLMs), where inpainting assignments steer policy optimization during reinforcement learning, addressing sparse reward scenarios (Zhao et al., 12 Sep 2025).
In sum, guided sampling and inpainting unite explicit, context-aware conditioning with robust generative priors, yielding controllable, high-fidelity reconstruction in complex semantic environments via architectural design, feature alignment, and targeted sampling interventions.