CorrFill: Training-Free Reference Guidance

Updated 4 July 2026

Reference Guidance is a technique where a damaged image is restored by aligning its missing regions with a provided reference, ensuring geometric and content fidelity.
CorrFill extracts correspondence signals from internal diffusion self-attention maps and refines them via filtering and smoothing to overcome patch mismatches.
By integrating attention masking and latent tensor optimization in a cyclic enhancement loop, CorrFill significantly improves inpainting accuracy on benchmarks like RealEstate10K.

CorrFill is a training-free module for reference-based image inpainting that improves how faithfully a diffusion model restores missing content from a provided reference image. In this setting, the input consists of a damaged target image $I_{\text{tar} \in \mathbb{R}^{h \times w \times 3}$, a binary mask $M \in \{0,1\}^{h \times w}$ marking the missing region, and a reference image $I_{\text{ref} \in \mathbb{R}^{h \times w \times 3}$ depicting the same scene, often from a nearby viewpoint or another moment. The objective is not merely plausible completion, but recovery of the target’s original content according to the actual geometry and content visible in the reference. CorrFill addresses the observation that diffusion-based inpainting methods often exploit the reference only loosely, producing outputs that are semantically plausible yet geometrically unfaithful. Its central contribution is to extract weak target-reference correspondence cues from diffusion self-attention during denoising and convert them into explicit guidance, without retraining the backbone (Liu et al., 4 Jan 2025).

1. Task definition and failure mode addressed

Reference-based image inpainting differs from generic exemplar-based editing because the missing region is expected to be reconstructed according to the original scene rather than replaced by any semantically compatible content. In the formulation used by CorrFill, the reference is therefore not auxiliary style information; it is an evidential constraint on the geometry and identity of the missing target region (Liu et al., 4 Jan 2025).

The method is motivated by a limitation in existing diffusion-based approaches. Conditioning schemes such as global or spatial image prompts, and stitched-image setups in which reference and target are concatenated side by side, allow information flow from the reference but do not explicitly specify which target patch should attend to which reference patch. The result is a characteristic failure mode: unwanted objects may appear, scene layout may be incorrect, and copied content may originate from semantically related but geometrically wrong regions (Liu et al., 4 Jan 2025).

This makes faithfulness the governing criterion of CorrFill. The method is not primarily concerned with enlarging the expressive capacity of the backbone. Instead, it introduces explicit correspondence constraints during sampling so that the inpainting process becomes aware of geometric correlation already latent in the model’s own attention maps (Liu et al., 4 Jan 2025).

2. Position within reference-guided generation

CorrFill belongs to a broader class of reference-guided generation methods, but its mechanism is specific. In supervised encoder-decoder restoration, reference information can be injected through explicit alignment and fusion modules, as in TransRef’s multi-scale reference embedding for inpainting (Liu et al., 2023). In training-free diffusion guidance, control can be introduced through external gradients or latent-space optimization without changing the pretrained generator, as in Semantic Diffusion Guidance (Liu et al., 2021). CorrFill is closer to the latter direction, but it specializes the idea to reference-based inpainting by deriving online correspondences from self-attention itself rather than from an external similarity model (Liu et al., 4 Jan 2025).

Its implementation is built on Stable Diffusion v2 Inpainting in a latent diffusion setting. Following the stitched-image formulation, the reference and target are concatenated horizontally into

$I_{\text{ref;tar} \in \mathbb{R}^{h \times 2w \times 3}.$

After VAE encoding,

$\epsilon(I_{\text{ref;tar}) \in \mathbb{R}^{h' \times 2w' \times d},$

the latent is concatenated with a noise latent $N^\epsilon \in \mathbb{R}^{h' \times 2w' \times d}$ and a downsampled mask $M^\epsilon \in \{0,1\}^{h' \times 2w'}$ , producing

$z_T \in \mathbb{R}^{h' \times 2w' \times (2d+1)}.$

At each denoising step, the U-Net receives the current latent together with correspondence information estimated from the previous step, creating a cyclic refinement process rather than a static conditioning pipeline (Liu et al., 4 Jan 2025).

A plausible implication is that CorrFill treats correspondence as an evolving latent variable of the denoising trajectory rather than as a preprocessing artifact. That distinguishes it from reference alignment pipelines that compute matching only once before generation.

3. Online correspondence estimation from self-attention

CorrFill starts from the claim that self-attention maps inside the inpainting diffusion model already contain primitive target-reference correlation signals, including in masked regions. For denoising step $t$ , self-attention maps are collected from multiple layers, heads are averaged, all maps are rescaled to a common spatial resolution, and the results are summed to form an aggregated attention tensor

$A_t \in \mathbb{R}^{(h' \times 2w') \times (h' \times 2w')}.$

Only the block corresponding to queries from the target and keys/values from the reference is retained, giving

$M \in \{0,1\}^{h \times w}$ 0

Rather than matching from a single layer or timestep, CorrFill accumulates temporal consensus: $M \in \{0,1\}^{h \times w}$ 1 where $M \in \{0,1\}^{h \times w}$ 2 is the initial diffusion timestep. The correspondence for target token $M \in \{0,1\}^{h \times w}$ 3 is then chosen by nearest-neighbor selection: $M \in \{0,1\}^{h \times w}$ 4 This converts the model’s own self-attention into a dynamic correspondence estimator (Liu et al., 4 Jan 2025).

Because raw correspondences are noisy, CorrFill refines them by filtering and smoothing. Filtering targets a failure mode in which many target tokens collapse onto a few heavily attended reference tokens. These are called dominant tokens. If a reference token is selected by more than a threshold number of target tokens, empirically $M \in \{0,1\}^{h \times w}$ 5, those assignments are treated as outliers and removed from the valid set. The excluded correspondences are stored as $M \in \{0,1\}^{h \times w}$ 6, because they remain useful later as regions to suppress in attention masking (Liu et al., 4 Jan 2025).

Smoothing is based on the observation that correspondences near the mask boundary are more reliable than those deep inside the hole. CorrFill therefore defines a displacement field

$M \in \{0,1\}^{h \times w}$ 7

and a consensus-weight matrix

$M \in \{0,1\}^{h \times w}$ 8

Outlier correspondences receive zero weight. The displacement is then replaced by a weighted neighborhood average,

$M \in \{0,1\}^{h \times w}$ 9

with

$I_{\text{ref} \in \mathbb{R}^{h \times w \times 3}$0

and the refined correspondence becomes

$I_{\text{ref} \in \mathbb{R}^{h \times w \times 3}$1

This suppresses isolated erroneous matches and enforces locally coherent geometry (Liu et al., 4 Jan 2025).

4. Guidance mechanisms during denoising

Once correspondences have been estimated at step $I_{\text{ref} \in \mathbb{R}^{h \times w \times 3}$2, CorrFill uses them to guide the next denoising step $I_{\text{ref} \in \mathbb{R}^{h \times w \times 3}$3 in two ways: attention masking and latent tensor optimization (Liu et al., 4 Jan 2025).

For attention masking, the method modifies the self-attention affinity matrix. Standard attention uses

$I_{\text{ref} \in \mathbb{R}^{h \times w \times 3}$4

where $I_{\text{ref} \in \mathbb{R}^{h \times w \times 3}$5 and $I_{\text{ref} \in \mathbb{R}^{h \times w \times 3}$6 are query and key matrices and $I_{\text{ref} \in \mathbb{R}^{h \times w \times 3}$7 is the attention embedding dimension. CorrFill replaces this with

$I_{\text{ref} \in \mathbb{R}^{h \times w \times 3}$8

where

$I_{\text{ref} \in \mathbb{R}^{h \times w \times 3}$9

is a correspondence-derived additive mask. For an inlier correspondence $I_{\text{ref;tar} \in \mathbb{R}^{h \times 2w \times 3}.$0, the mask boosts affinity within a neighborhood of the matched reference location and sets all other reference positions to $I_{\text{ref;tar} \in \mathbb{R}^{h \times 2w \times 3}.$1, while leaving non-reference parts unchanged. For an outlier correspondence $I_{\text{ref;tar} \in \mathbb{R}^{h \times 2w \times 3}.$2, the mask only blocks the suspect dominant region, preventing repeated attention to irrelevant reference areas (Liu et al., 4 Jan 2025).

Attention masking alone is not always strong enough, so CorrFill also performs latent tensor optimization. It reshapes and resizes target-to-reference attention maps from all self-attention layers, and optimizes the current latent $I_{\text{ref;tar} \in \mathbb{R}^{h \times 2w \times 3}.$3 so that those attention distributions approach a one-hot map centered at the estimated correspondence. The paper states that the objective uses normalized attention, a sigmoid mapping into $I_{\text{ref;tar} \in \mathbb{R}^{h \times 2w \times 3}.$4, and a weighted binary cross-entropy against the one-hot encoding $I_{\text{ref;tar} \in \mathbb{R}^{h \times 2w \times 3}.$5. Although the printed equation is truncated typographically, the intended meaning is explicit: target-to-reference attention should be pushed toward the estimated geometric match. Conceptually, the update is of the form

$I_{\text{ref;tar} \in \mathbb{R}^{h \times 2w \times 3}.$6

This latent update is applied layerwise using gradients from the auxiliary objective (Liu et al., 4 Jan 2025).

These two mechanisms form a cyclic enhancement loop. Correspondences estimated from self-attention at one step constrain the next denoising step through attention masks and latent updates; the resulting denoising pass yields new self-attention maps; these produce better correspondences; and the cycle continues. This suggests that CorrFill is best understood as a self-bootstrapping control mechanism internal to diffusion sampling rather than a static attachment to the U-Net (Liu et al., 4 Jan 2025).

5. Empirical behavior and ablation evidence

CorrFill is evaluated on RealEstate10K and MegaDepth, with emphasis on reference faithfulness. The implementation uses $I_{\text{ref;tar} \in \mathbb{R}^{h \times 2w \times 3}.$7 target and reference images, a stitched latent size $I_{\text{ref;tar} \in \mathbb{R}^{h \times 2w \times 3}.$8, and DDIM sampling with 50 steps. It is attached to multiple baselines, including Paint-by-Example, IP-Adapter-Plus, a side-by-side baseline, and LeftRefill (Liu et al., 4 Jan 2025).

On RealEstate10K, the reported PSNR improvements are:

Method	Baseline PSNR	With CorrFill
Paint-by-Example	20.03	21.57
IP-Adapter-Plus	21.26	25.10
Side-by-side	23.32	25.81
LeftRefill	26.71	26.97

The gains are largest for methods lacking explicit geometric grounding, while even LeftRefill, already the strongest baseline and already based on a stitched formulation with prompt tuning, improves slightly in PSNR and also gains in SSIM and LPIPS (Liu et al., 4 Jan 2025).

On MegaDepth, improvements are smaller. The authors attribute this to more severe viewpoint changes and dynamic scene variation, which make 2D correspondence estimation harder and sometimes reduce the benefit of strict adherence to the reference. Even so, CorrFill improves Paint-by-Example, IP-Adapter-Plus, and the side-by-side baseline across all metrics, while LeftRefill changes only marginally (Liu et al., 4 Jan 2025).

The ablation study clarifies the role of each component. For the side-by-side baseline on RealEstate10K, PSNR progresses from $I_{\text{ref;tar} \in \mathbb{R}^{h \times 2w \times 3}.$9 to $\epsilon(I_{\text{ref;tar}) \in \mathbb{R}^{h' \times 2w' \times d},$0 with attention masking alone, to $\epsilon(I_{\text{ref;tar}) \in \mathbb{R}^{h' \times 2w' \times d},$1 with outlier filtering, to $\epsilon(I_{\text{ref;tar}) \in \mathbb{R}^{h' \times 2w' \times d},$2 with correspondence smoothing, and to $\epsilon(I_{\text{ref;tar}) \in \mathbb{R}^{h' \times 2w' \times d},$3 after adding latent $\epsilon(I_{\text{ref;tar}) \in \mathbb{R}^{h' \times 2w' \times d},$4 optimization. For LeftRefill, attention masking alone slightly hurts performance, $\epsilon(I_{\text{ref;tar}) \in \mathbb{R}^{h' \times 2w' \times d},$5 to $\epsilon(I_{\text{ref;tar}) \in \mathbb{R}^{h' \times 2w' \times d},$6, but filtering and smoothing recover the loss, and latent optimization reaches $\epsilon(I_{\text{ref;tar}) \in \mathbb{R}^{h' \times 2w' \times d},$7 (Liu et al., 4 Jan 2025).

This pattern suggests two things. First, latent optimization provides the largest quantitative gain. Second, stronger baselines are more sensitive to noisy correspondence estimates, so refinement through filtering and smoothing is especially important before applying stronger guidance (Liu et al., 4 Jan 2025).

6. Limitations, scope, and significance

CorrFill’s limitations follow directly from its correspondence-based design. It can fail in scenes with repetitive structures or complex geometry because correspondences become ambiguous. It also struggles under large viewpoint changes, where 2D patch correspondences are insufficient. The paper gives the example of a statue with incorrect head orientation, illustrating that even roughly correct 2D matching may be inadequate when the underlying object transformation requires 3D reasoning (Liu et al., 4 Jan 2025).

Accordingly, CorrFill should not be interpreted as a full 3D-aware alignment system. Its strength lies in the regime where reliable 2D geometric correlation exists between target and reference. In that regime, it increases faithfulness without introducing new trainable modules, adapters, retraining, or fine-tuning of the diffusion backbone (Liu et al., 4 Jan 2025).

Its significance is therefore methodological rather than architectural. CorrFill does not propose a new inpainting backbone. It shows that internal diffusion self-attention already contains usable correspondence evidence, and that this evidence can be turned into a practical control signal through temporal consensus, outlier filtering, spatial smoothing, self-attention masking, and latent optimization. In that sense, its main contribution is a training-free mechanism for making existing diffusion inpainting systems more faithful to the reference image (Liu et al., 4 Jan 2025).