Sketch Filling Fusion for Multimodal Inpainting

Updated 4 December 2025

Sketch Filling Fusion (SFF) is a multi-input image composition framework that fuses binary sketches and reference images to guide precise, user-driven inpainting.
It employs a structure-aware UNet with dual conditioning branches, using FiLM and CLIP-based cross-attention to enforce both structural and content fidelity.
Quantitative evaluations show SFF reduces pixel errors and FID scores, demonstrating superior performance over traditional inpainting methods.

Sketch Filling Fusion (SFF) is a multi-input-conditioned image composition framework designed to enable precise, user-driven image manipulation and inpainting by fusing sketch-based structural guidance with reference image-based content transfer. SFF fine-tunes a pre-trained latent diffusion model, integrating a binary sketch and a reference exemplar image to control the completion of missing regions in images at both the structural and textural levels. This approach achieves enhanced editability and fine-grained control, demonstrated by superior quantitative and qualitative results in targeted inpainting and composition tasks (Kim et al., 2023).

1. Model Inputs, Preprocessing, and Latent Encoding

SFF operates on four distinct inputs: a partially observed source image ( $x_p \in \mathbb{R}^{3 \times H \times W}$ ) with a masked region, a binary mask ( $m \in \{0,1\}^{H \times W}$ ) designating the region to fill, a binary sketch ( $s \in \{0,1\}^{H \times W}$ ) providing edge-level structure, and a reference exemplar image ( $x_r \in \mathbb{R}^{3 \times H' \times W'}$ ) offering content fidelity. Sketches are extracted via PiDiNet edge detection and binarized. Masks, sampled as rectangles or free-form, are used for both training and user-driven inference. The exemplar crop is derived from the ground-truth image per the mask for training, while at inference, users provide custom reference images.

All images are encoded using an autoencoder (taken from Stable Diffusion) into downsampled latent representations. The forward noising process in latent space is defined as:

$q(z_t | z_0) = \mathcal{N}(z_t; \sqrt{\bar{\alpha}_t} z_0, (1-\bar{\alpha}_t) I),$

where $\bar{\alpha}_t = \prod_{i=1}^{t} \alpha_i$ , $\alpha_i = 1-\beta_i$ . At each synthesis step, the mask $m$ and sketch $s$ are concatenated with noisy latents to form the input to the core model.

2. Structure-aware UNet Architecture

The central building block of SFF is a structure-aware UNet ( $\epsilon_\theta$ ), which augments standard denoising UNet architectures with dual conditioning branches:

Reference Branch: A frozen CLIP encoder (ResNet-50 or ViT) processes $x_r$ into an embedding $h_r$ , followed by a 2-layer MLP to yield a conditioning vector $c \in \mathbb{R}^D$ . This vector is injected into all UNet blocks via cross-attention, replacing text cross-attention tokens with $c$ .
Sketch Branch: A shallow CNN (stacked $3 \times 3$ convolutions with ReLU) lifts the binary sketch $s$ to a feature map $f_s \in \mathbb{R}^{c_s \times h \times w}$ . This map modulates convolutional features at every layer via FiLM:

$F_\text{out} = \gamma(f_s) \circ F_\text{in} + \beta(f_s),$

where $\gamma$ , $\beta$ are $1 \times 1$ convolutions over $f_s$ .

The mask $m$ is both concatenated to the input tensor and early UNet feature maps, distinctly identifying which regions to fill versus preserve.

3. Conditioning Mechanisms and Sampling Schedule

Reference and sketch conditioning act independently and jointly within the UNet, enabling the fusion of high-level appearance cues and low-level structure control:

Reference Embedding: Enables pixel-wise content transfer from $x_r$ to the masked region via CLIP-based cross-attention at every UNet block.
Sketch Fusion: Enforces local, edge-level fidelity through FiLM modulation at all convolutional layers, ensuring the output respects user-defined structure.
Mask Guidance: Guides the model to localize inpainting strictly to the user-specified mask.
Sketch Plug-and-Drop: Optionally disables FiLM modulation after a specified timestep $t_0$ during DDPM sampling, which improves naturalness by partially relaxing rigid sketch constraints when structure is too coarse.

During sampling, the reverse denoising process is conditioned as:

$p_\theta(z_{t-1} | z_t, s, r) = \mathcal{N}\left(z_{t-1}; \mu_\theta(z_t, t, s, r), \sigma_t^2 I \right).$

The input to UNet is $[z_t; s; m] \in \mathbb{R}^{c+2 \times h \times w}$ .

4. Training Protocol and Hyperparameters

SFF is trained on the Danbooru cartoon subset, comprising 55,104 training and 13,775 test samples, where edge maps for sketches are generated using PiDiNet. Masks cover 5–30% of image area per sample, with both rectangular and free-form shapes.

Initialization uses Paint-by-Example weights based on Stable Diffusion. Training is performed for 40 epochs on 4 NVIDIA V100 GPUs ( $\sim$ 2 days). The model is optimized using AdamW with a learning rate of $1 \times 10^{-5}$ , weight decay $0.01$, and batch size of 4 at an image size of $512 \times 512$ . The number of diffusion steps $T$ is set to 1,000 with a linear $\beta$ schedule.

The training objective is a noise prediction loss:

$L = \mathbb{E}_{z_0 \sim E(x),\ s,\ m,\ x_r,\ t,\ \epsilon \sim \mathcal{N}(0,I)} \left[ \| \epsilon - \epsilon_\theta(z_t, t, s, m, c) \|^2 \right],$

with $\epsilon \sim \mathcal{N}(0, I)$ .

5. Quantitative and Qualitative Evaluation

SFF demonstrates quantifiable improvements over single-modality and multimodal inpainting baselines. Ablations comparing reference-only (Paint-by-Example), text+sketch (Paint-by-T+S), and reference+sketch (SFF) show significant performance gains:

Metric	Paint-by-E	Paint-by-T+S	SFF (Ref+Sketch)
L₁ error	0.0866	0.0851	0.0680
L₂ error	0.0380	0.0313	0.0239
FID	6.314	6.314	5.716

Sketch guidance reduces pixel error by 20–30% and lowers FID by ~10%. LPIPS scores for SFF ( $\sim$ 0.15) also outperform reference-only baselines ( $\sim$ 0.20), indicating greater perceptual fidelity.

Qualitative highlights include:

Precise edge placement (hair and clothing boundaries) dictated by the sketch.
Consistency and transfer of patterns and colors from the reference image.
Flexible editing: swapping sketches and reference exemplars enables the synthesis of arbitrary objects or scene modifications.
Relaxed sampling (sketch plug-and-drop) yields visually plausible backgrounds even when sketches are coarse.

Use cases demonstrated include background scene extension in Webtoon panels, local object shape editing (hair, beard), and multi-reference object replacement (e.g., swapping shirt patterns).

6. Applications, Extensibility, and Implications

SFF supports a range of use cases for controllable image manipulation—especially in multimodal inpainting scenarios—without sacrificing edge or content fidelity. The fusion of sketch and reference conditioning enables practitioners to reproduce the approach, integrate novel sketch or exemplar encoders, and extend the framework to new tasks in composition and manipulation. A plausible implication is that structure-aware fusion modules like FiLM on sketch features may generalize to other domains requiring strict spatial control. The “plug-and-drop” mechanism offers adaptable structure enforcement, balancing rigidity and realism in composition.

Practitioners can adopt the SFF pipeline for tasks requiring fine-grained user-driven edits, compositional inpainting, or synthesis of new objects and backgrounds via interactive sketch and reference fusion (Kim et al., 2023).

PDF Markdown Chat (Pro)

References (1)

Reference-based Image Composition with Sketch via Structure-aware Diffusion Model (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Sketch Filling Fusion (SFF).