Categorical Inpainting: Methods & Applications

Updated 4 July 2026

Categorical inpainting is an image completion approach that models missing regions by leveraging semantic categories instead of relying solely on texture synthesis.
Techniques range from explicit semantic segmentation and contour prediction to latent variable manipulation and discrete token inference, ensuring structured and multimodal outputs.
These methods decouple semantic guidance from texture rendering to improve object boundaries, enable user interaction, and tackle the challenges of evaluating pluralistic completions.

Searching arXiv for recent and foundational papers on categorical, semantic, and controllable inpainting to ground the article.

Categorical inpainting is the branch of image inpainting in which the missing region is modeled, constrained, or steered at the level of semantic categories rather than treated only as a texture-completion problem. In the literature, the term covers several distinct regimes: explicit or implicit category-aware completion when a hole spans multiple semantic classes; latent semantic control in which users steer the completion through interpretable variables; discrete-token formulations in which missing content is inferred as categorical codebook labels; and coarse semantic mode selection such as foreground versus background, or creation versus removal (Liao et al., 2020, Xu et al., 2018, Chen et al., 2024, Xu et al., 29 Apr 2025, Jeon, 6 Mar 2025). A central theme across these formulations is that the ill-posed conditional distribution of completions is not merely continuous and low-level, but often multimodal at the level of objects, scene parts, boundaries, and semantic layouts (Ballester et al., 2022).

1. Conceptual scope and definitions

In the narrowest sense, categorical inpainting would mean explicit class-conditioned completion: a model receives a semantic label or layout and fills the missing region accordingly. The papers considered here show that much of the literature uses a broader notion. Some methods infer semantic categories internally rather than taking them as user-specified inputs, as in joint segmentation–inpainting systems that predict a semantic map inside the hole and then complete textures in a semantic-wise manner (Liao et al., 2020). Others replace explicit labels with weaker but still categorical control signals, such as learned foreground and background embeddings (Xu et al., 29 Apr 2025) or learned creation and removal prompts (Jeon, 6 Mar 2025). Still others are categorical in a representational sense: they infer missing discrete latent token labels from a learned codebook, so the unknown region is completed as a set of categorical latent assignments before pixel synthesis (Chen et al., 2024).

This broader usage is important because several influential methods explicitly do not provide supervised class labels at inference time. “Controllable Semantic Image Inpainting” is not a class-conditional model with an explicit conditioning variable $y$ ; instead, it supports semantic steering through unsupervised disentangled latent manipulation (Xu et al., 2018). “ControlFill” similarly does not allow exact object-class specification, but it does distinguish between object-like creation and background extension through two learned intent embeddings (Jeon, 6 Mar 2025). “PixelHacker” uses category information during data construction and training, yet compresses 116 foreground categories and 21 background categories into only two latent embeddings, so its guidance is category-aware but not category-selectable in a fine-grained sense (Xu et al., 29 Apr 2025).

A persistent misconception is therefore that categorical inpainting must always be explicit label-conditioned generation. The literature instead supports a spectrum ranging from strict class supervision to latent, grouped, or inferred semantics. This suggests that “categorical” is best understood as referring to the semantic granularity at which the missing region is represented or constrained, rather than only to the interface exposed to the user.

2. Semantic representations used in categorical inpainting

The field has explored multiple semantic representations, each imposing a different notion of category structure on the missing region.

One representation is the predicted semantic segmentation map. In “Image Inpainting Guided by Coherence Priors of Semantics and Textures,” the model jointly predicts an inpainted image $\hat I$ and a segmentation map $\hat S$ , and then uses the predicted categories to guide texture propagation within each semantic class (Liao et al., 2020). This is category-aware at the level of region labels and boundaries. The key claim is that holes involving multiple semantic categories are difficult because of obscure semantic boundaries and mixed textures, so the inpainting process should be decomposed into semantic subregions before texture completion.

A second representation is the foreground contour. “Foreground-aware Image Inpainting” predicts the missing foreground contour first and then uses it to guide content completion (Xiong et al., 2019). This is weaker than full multi-class semantic labeling, but stronger than pure texture synthesis. The contour encodes the foreground/background partition within the hole, which is especially important when the mask overlaps with or touches salient objects.

A third representation is the continuous latent semantic code. In “Controllable Semantic Image Inpainting,” the masked image and binary mask define a context $x_c = m \odot x$ and a target $x_t = (1-m)\odot x$ , and a latent vector $z$ is inferred from the context and then manipulated to steer the semantics of the completion (Xu et al., 2018). The method is categorical only in a weak, unsupervised sense: some latent directions correspond to factors such as mouth shape, gender, azimuth, skin tone, hue, brightness, or face width, but these factors are discovered post hoc rather than aligned to explicit labels.

A fourth representation is the discrete latent token. “Don’t Look into the Dark: Latent Codes for Pluralistic Image Inpainting” represents images through a VQGAN-style codebook $C = \{z_k\}_{k=1}^{K}$ and token labels $Y \in \mathbb{Z}$ , and formulates inpainting as prediction of missing token categories (Chen et al., 2024). In this setting, categorical inpainting is literal: the model predicts a categorical distribution over discrete codebook labels for missing spatial locations.

A fifth representation is the task-intent embedding. “ControlFill” learns two conditional embeddings, $y_c$ for creation and $y_r$ for removal, and uses classifier-free guidance to modulate the denoising trajectory toward object-like insertion or background completion (Jeon, 6 Mar 2025). The categories here are coarse semantic modes rather than object classes. “PixelHacker” adopts a related but distinct design through foreground and background embeddings injected into latent diffusion by cross-attention, again yielding coarse category guidance rather than fine-grained class control (Xu et al., 29 Apr 2025).

These representations differ in explicitness, granularity, and controllability, but they share a common function: they separate high-level semantic uncertainty from low-level texture synthesis.

3. Architectural decompositions: structure, semantics, and texture

A recurrent architectural pattern in categorical inpainting is to factor the problem into a semantic or structural stage and a texture-synthesis stage.

“Foreground-aware Image Inpainting” is an explicit example of this decomposition. Its pipeline comprises an incomplete contour detection module, a contour completion module, and an image completion module (Xiong et al., 2019). The contour model predicts missing foreground boundaries, and the image model then synthesizes appearance conditioned on the completed contour. The practical motivation is that conventional methods often fail when the hole intersects object boundaries because they do not explicitly reason about the foreground/background extent inside the missing region.

“Image Inpainting Guided by Coherence Priors of Semantics and Textures” uses a related but richer decomposition. It employs a multi-scale shared decoder with two task-specific heads: one for inpainting and one for semantic segmentation. Between scales, the Semantic-Wise Attention Propagation module updates the decoder feature according to

$\hat I$ 0

so the evolving segmentation prediction refines the next-scale inpainting features (Liao et al., 2020). Within SWAP, patch matching is performed only within the same predicted semantic class, which is intended to prevent cross-category texture borrowing.

“Controllable Semantic Image Inpainting” separates global semantics from local consistency in a different way. The latent code $\hat I$ 1 captures high-level semantic intent, while a bidirectional PixelCNN handles local texture continuity. The forward autoregressive model is conditioned not only on previously generated target pixels, but also on reverse-context features $\hat I$ 2 and latent-decoder features $\hat I$ 3, with each forward layer computed as

$\hat I$ 4

The reverse PixelCNN is introduced because a standard forward PixelCNN cannot directly access context below or to the right of a target location, which leads to seams along the lower and right boundaries of the inpainted region (Xu et al., 2018).

“Don’t Look into the Dark” generalizes this factorization into a four-stage pipeline: pretrained VQGAN tokenizer, restrictive partial encoder for visible tokens, bidirectional transformer for missing token inference, and an overview network that combines the token features with partial-image priors (Chen et al., 2024). The restrictive encoder and the decoder use different partial-convolution strategies because their roles differ: the encoder must avoid contamination from the hole, whereas the decoder must propagate visible information into masked regions for seamless composition.

“PixelHacker” moves the same separation into latent diffusion. The model uses a frozen SDXL-VAE, performs denoising in latent space, and intermittently injects latent category guidance through cross-attention after gated linear self-attention blocks in both downsampling and upsampling stages (Xu et al., 29 Apr 2025). Here, semantic grouping is not exposed as an external label map; it is baked into the denoising dynamics through two fixed-size embeddings.

4. Probabilistic and pluralistic formulations

Categorical inpainting is closely related to the more general problem of multiple or pluralistic inpainting. “An Analysis of Generative Methods for Multiple Image Inpainting” formulates the task as learning the multimodal conditional distribution $\hat I$ 5, where the same masked observation may admit several plausible completions (Ballester et al., 2022). This perspective is directly relevant to categorical uncertainty: different completions may correspond to different semantic modes, object configurations, or scene layouts.

Within this probabilistic view, several papers emphasize that high-quality inpainting should not collapse the conditional distribution to a single deterministic estimate. “Controllable Semantic Image Inpainting” does so through disentangled latent variables inferred from the visible context. Its training objective differs from a standard VAE because the encoder receives only the context and mask, $\hat I$ 6, while the decoder reconstructs the full image, thereby forcing $\hat I$ 7 to encode likely global semantics of the complete image (Xu et al., 2018). The paper’s comparison among VAE, $\hat I$ 8-TCVAE, Info- $\hat I$ 9-TCVAE, and InfoVAE-MMD argues that a strong autoregressive decoder tends to ignore latent codes unless the regularization preserves informativeness.

“Don’t Look into the Dark” addresses the same multimodality by replacing continuous latent noise with discrete token prediction. Its transformer learns a categorical distribution over missing labels,

$\hat S$ 0

and different samples from this distribution yield different valid token maps and therefore different completed images (Chen et al., 2024). The method further uses adaptive temperature during MaskGIT-style iterative decoding to balance diversity and coherence.

The survey chapter identifies a broader pattern: the most successful generative strategies tend to sample a coarse multimodal structure first and then render texture in a later stage (Ballester et al., 2022). This suggests that categorical inpainting is often best implemented not by direct RGB-space variation, but by placing uncertainty in a structured latent representation—tokens, contours, semantic maps, or coarse layouts.

5. Controllability and user interaction

The literature distinguishes sharply between internal semantic guidance and user-facing semantic control.

“Controllable Semantic Image Inpainting” is explicitly user-controllable, but the control channel is an unsupervised latent vector rather than a semantic label (Xu et al., 2018). The user modifies selected coordinates of $\hat S$ 1, identified through latent traversal, and the paper reports that five independent factors can each be controlled by a single latent variable and that several mouth-related factors can be manipulated jointly with approximately superimposed effects. This is strong attribute-level steering, but it does not guarantee named or label-aligned categories.

“ControlFill” makes the control signal more operational. At inference time, users steer the model through learned creation and removal embeddings and through classifier-free guidance, with removal mode using $\hat S$ 2 as positive and $\hat S$ 3 as negative, and creation mode reversing the roles (Jeon, 6 Mar 2025). Its spatially varying guidance further assigns different intentions to different pixels through a ternary mask $\hat S$ 4, enabling mixed creation and removal in one denoising run. Yet the paper is explicit that in creation mode it is not possible to specify the exact object classes to be generated inside the mask.

“PixelHacker” occupies an even weaker control regime. The model is category-aware through foreground and background embeddings, but the papers do not describe a user interface in which one chooses a specific foreground class or background class at inference time (Xu et al., 29 Apr 2025). The guidance acts as an internal semantic prior rather than an editable conditioning variable.

By contrast, “Image Inpainting Guided by Coherence Priors of Semantics and Textures” predicts segmentation maps internally and uses them to constrain texture propagation, but the segmentation is not provided by the user (Liao et al., 2020). In this sense, the method supports category-consistent completion, not category-conditioned editing.

A plausible implication is that categorical inpainting currently spans two partially disconnected goals: semantic plausibility under weak or inferred category structure, and explicit user control over target categories. The cited literature is much stronger on the former than on the latter.

Evaluation remains a central difficulty because semantic or categorical correctness is often not equivalent to pixelwise fidelity. “Perceptual Artifacts Localization for Inpainting” argues that metrics such as PSNR, MSE, SSIM, and LPIPS are poor proxies when the task is object removal or content replacement, because a result may differ greatly from the original image and still be perceptually good (Zhang et al., 2022). The paper introduces Perceptual Artifact Ratio,

$\hat S$ 5

the ratio of objectionable inpainted area to total hole area, and reports that PAR reaches 72.89% agreement with strong human preference on object-removal comparisons. It also localizes only the bad subregions of an inpainted hole and uses the predicted artifact mask for iterative re-inpainting, which improves quality across LaMa, ProFill, CoMod-GAN, and EdgeConnect.

For categorical inpainting, this evaluation problem is more severe. “Controllable Semantic Image Inpainting” does not provide a quantitative metric specifically for semantic controllability or category correctness, and its evidence for controllability is primarily qualitative and ablation-based (Xu et al., 2018). “ControlFill” evaluates creation and removal through LPIPS, FID, HPSv2, CLIP score, DEPTH_REL, and DEPTH_DIFF, but its semantic categories remain coarse (Jeon, 6 Mar 2025). “PixelHacker” reports strong FID and LPIPS across Places2, CelebA-HQ, and FFHQ, yet its guidance still collapses many source categories into foreground/background groups (Xu et al., 29 Apr 2025). “Image Inpainting Guided by Coherence Priors of Semantics and Textures” adds segmentation mIoU, which is more directly tied to semantic plausibility, but its category predictions are internally inferred and therefore sensitive to segmentation quality (Liao et al., 2020).

Several limitations recur across the literature. Fine-grained class control is generally absent: unsupervised latents are not label-aligned (Xu et al., 2018), creation mode does not specify exact object classes (Jeon, 6 Mar 2025), and foreground/background embeddings remain coarse (Xu et al., 29 Apr 2025). Category-aware systems often depend on auxiliary structure estimators such as saliency masks, contours, or segmentation maps, so their failure can propagate into the final image (Xiong et al., 2019, Liao et al., 2020). Discrete-token systems improve multimodality but add iterative transformer inference and lossy quantization, with reported inference around 314 ms per $\hat S$ 6 image and most time spent in token prediction (Chen et al., 2024). Evaluation datasets and demonstrations can also be narrow, as in the low-resolution CelebA setting of controllable semantic inpainting (Xu et al., 2018).

Taken together, these works define categorical inpainting as a family of methods that elevate semantic category structure—whether explicit, inferred, latent, or discrete—to a first-class component of image completion. The field has established that semantic and structural decomposition improves object boundaries, class-consistent textures, pluralistic sampling, and controllability. It has not yet established a single dominant formulation for precise, user-specified class-conditioned inpainting.