RefSel-HQ Dataset for Facial Image Restoration

Updated 4 July 2026

RefSel-HQ is a dataset that provides 10,000 facial triplets (ground-truth, high-quality reference, and binary mask) for supervising conditional reference selection in restoration tasks.
It features a dual-stage mask generation approach combining manual annotation and U-Net propagation to capture precise texture consistency between same-identity images.
The dataset’s annotations focus on dynamic and static facial texture conflicts, enabling optimized transfer decisions for enhanced identity preservation in restoration pipelines.

Searching arXiv for the cited paper and closely related dataset context. RefSel-HQ is a dataset introduced alongside the blind facial image restoration method RefSTAR to train an explicit reference selection module, RefSel, for deciding which facial regions should inherit textures from a high-quality reference image. It comprises 10,000 triplets $(I, I^{\rm Ref}, M)$ at $512 \times 512$ resolution, where $I$ is a ground-truth facial image, $I^{\rm Ref}$ is a distinct high-quality image of the same identity, and $M$ is a binary mask indicating the facial regions whose textures should be transferred from the reference. The dataset was designed to support reference-aware restoration under unknown degradations, with supervision focused on identity-relevant texture consistency rather than unconstrained feature borrowing (Yin et al., 14 Jul 2025).

1. Definition and role within RefSTAR

RefSel-HQ serves as the training substrate for the RefSel module in RefSTAR, a method organized around reference selection, transfer, and reconstruction for blind facial image restoration (Yin et al., 14 Jul 2025). Its central purpose is to supervise a predictor that estimates where reference-image textures are appropriate and where they should be suppressed.

In this formulation, the mask $M$ is not a generic segmentation target. It denotes “exactly those facial regions whose textures should be transferred from the reference image.” This makes RefSel-HQ distinct from ordinary face parsing corpora or landmark datasets: its annotations encode inter-image texture agreement and conflict for same-identity image pairs rather than semantic part categories.

Each sample is built from two distinct high-quality images of the same identity under different poses or expressions. The inclusion of a degraded input $I_{LQ}$ , synthesized from the ground-truth image, allows RefSel-HQ to be plugged directly into restoration pipelines in which the model must infer transferability from a low-quality observation and a high-quality reference. For full face-restoration training, however, RefSel-HQ is only used to train or presume the RefSel module; the backbone restoration model itself uses CelebRef-HQ with separate train/validation splits (Yin et al., 14 Jul 2025).

A plausible implication is that RefSel-HQ is best understood as a task-specific supervisory dataset for conditional reference selection rather than a general-purpose face restoration benchmark.

2. Source data and mask construction pipeline

The dataset is constructed from CelebRef-HQ and other public face collections to form 10,000 “ground-truth $\leftrightarrow$ reference” pairs (Yin et al., 14 Jul 2025). Each pair contains two high-quality images of the same identity, both at $512 \times 512$ resolution, but differing in pose and/or expression.

Mask generation follows a staged pipeline combining manual labeling, model propagation, and human correction. First, 800 image pairs were manually annotated to obtain pixel-wise masks $M_{\rm manual}$ . Second, a U-Net segmentation network $512 \times 512$ 0 was trained so that

$512 \times 512$ 1

Third, this network was run on the remaining approximately 9,200 pairs to generate propagated masks:

$512 \times 512$ 2

Fourth, the resulting masks underwent human-in-the-loop filtering and correction. About 12,000 masks were generated in total, after which low-quality ones were removed to obtain 10,000 final triplets $512 \times 512$ 3 (Yin et al., 14 Jul 2025).

The low-quality input $512 \times 512$ 4 is synthesized from each ground-truth image using the Real-ESRGAN pipeline with motion and defocus blur. An important exception is explicitly defined: if the synthesized degradation is so severe that no reliable textures remain, the mask $512 \times 512$ 5 is replaced by an all-ones face-only mask, described as “select everything,” so that RefSel does not attempt to match fine details that are no longer recoverable (Yin et al., 14 Jul 2025).

This construction strategy ties annotation semantics to restoration feasibility. The mask does not merely mark inter-image difference; it also adapts to whether the degraded input preserves enough local evidence for texture-consistent transfer.

3. Annotation protocol and mask semantics

The annotation protocol is organized around a “dual-axis” taxonomy of conflicts. The first axis is dynamic conflicts, including expression changes such as mouth open versus closed, eyes open versus closed, and wrinkles appearing or disappearing. The second axis is static conflicts, including freckles, birthmarks, heavy makeup, glasses, and other accessories (Yin et al., 14 Jul 2025).

All hair and background are always masked out, so supervision is restricted to facial textures. This means the dataset is explicitly face-region-centric and avoids entangling transfer supervision with hairstyle, scene content, or non-facial boundaries.

RefSel-HQ contains two mask types. The first consists of 800 manually drawn masks that precisely follow expression and accessory conflicts. The second consists of 9,200 U-Net-propagated masks that were post-filtered by human review (Yin et al., 14 Jul 2025). In both cases, the masks are binary segmentations with no soft labels.

The mask statistics and characteristics reported for the released dataset are summarized below.

Property	Value
Total triplets	10,000
Manual masks	800
Auto+filtered masks	9,200
Resolution	$512 \times 512$ 6
Average face-area coverage	45%
Coverage variability	$512 \times 512$ 7

The dataset description states that there are no random shapes or geometric primitives; every mask is data-driven and corresponds to real texture disagreements (Yin et al., 14 Jul 2025). This is significant because it differentiates RefSel-HQ from synthetic masking schemes that inject arbitrary occlusion patterns or patch selections. Here, the mask target is grounded in perceptually meaningful identity- and expression-linked discrepancies between same-identity exemplars.

A plausible implication is that the learned RefSel predictor is optimized for semantically structured transfer decisions rather than generic spatial attention.

4. Dataset organization and released format

The released dataset contains 10,000 triplets $512 \times 512$ 8, all stored as PNG files with 8-bit, no-compression image encoding (Yin et al., 14 Jul 2025). Ground-truth, reference, and degraded images are stored separately, while masks are divided into train and test subdirectories.

The official directory layout is:

$I^{\rm Ref}$ 6

The split used for RefSel training and testing consists of 9,750 pairs for training and 250 held-out pairs for testing (Yin et al., 14 Jul 2025). The train/test distinction applies to the mask supervision for RefSel rather than to the full restoration backbone.

The binary mask format is defined as $512 \times 512$ 9 “do not transfer” and $I$ 0 “use reference feature.” This binary convention is consistent with the dataset’s function as a target for a texture-consistency predictor rather than a soft correspondence estimator.

The release notes also specify that the dataset license is CC BY-SA 4.0 and that official code, data, and pretrained models are available in the RefSTAR repository (Yin et al., 14 Jul 2025).

5. Training usage and optimization target

In practical use, $I$ 1, $I$ 2, and the mask $I$ 3 are read as 3-channel tensors, and pixel values are scaled to either $I$ 4 or $I$ 5 (Yin et al., 14 Jul 2025). Random horizontal flips with probability $I$ 6 may be applied, but color jitter and heavy geometric distortion are explicitly not recommended.

For RefSel training, the degraded image and the reference image are concatenated into a six-channel tensor:

$I$ 7

which serves as the input to $I$ 8. The target is the binary mask

$I$ 9

The specified loss is Online Hard Example Mining Cross-Entropy:

$I^{\rm Ref}$ 0

where $I^{\rm Ref}$ 1 are the predicted probabilities, $I^{\rm Ref}$ 2 are the mask labels, and $I^{\rm Ref}$ 3 selects the hardest 30\% pixels per batch (Yin et al., 14 Jul 2025).

This training setup indicates that RefSel-HQ is used to learn a dense binary transferability field from degraded-reference pairs. The use of OHEM suggests particular emphasis on ambiguous or failure-prone pixels, especially around dynamic and static conflict regions. That interpretation follows directly from the loss definition and the mask design, though the dataset description does not provide additional ablation detail within the summary.

6. Evaluation protocols and relation to restoration objectives

For RefSel accuracy, the standard evaluation protocol reports pixel-wise accuracy, precision, and recall on the 250-pair test set. The summary notes that, in the paper, average accuracy was 0.88 across five conflict scenarios (Yin et al., 14 Jul 2025).

When a RefSel-HQ–trained RefSel module is inserted into the full RefSTAR pipeline, restoration performance is evaluated with a broader metric suite. These metrics include PSNR in dB and LPIPS for low-level fidelity; FID, using FFHQ as real data for real-world tests; ArcFace-based cosine similarity to ground truth (ID-GT) and to reference (ID-Ref) for identity preservation; and MUSIQ for no-reference perceptual quality (Yin et al., 14 Jul 2025).

The restoration network in RefSTAR is trained with the objective

$I^{\rm Ref}$ 4

where the final term is a mask-compatible cycle consistency loss and $I^{\rm Ref}$ 5 (Yin et al., 14 Jul 2025).

Within this objective, RefSel-HQ contributes the mask signal that gates cycle-consistent reconstruction of reference-derived regions. This ties the dataset directly to the paper’s broader claim that identity preservation problems arise from improper feature introduction on detailed textures. The mask is therefore not ancillary metadata; it is a supervisory mechanism for constraining when reference features should influence the restored output.

7. Significance, limitations, and interpretive context

RefSel-HQ occupies a specific position in reference-based face restoration. It does not provide generic facial labels, unconstrained pairwise correspondences, or arbitrary style-transfer regions. Instead, it operationalizes a narrower question: for two same-identity images under differing conditions, which facial textures are appropriate to transfer to improve restoration while preserving identity (Yin et al., 14 Jul 2025)?

Several design choices clarify the intended scope. First, all hair and background are excluded. Second, masks are binary rather than soft. Third, degraded inputs are synthetically generated with the Real-ESRGAN pipeline using motion and defocus blur. Fourth, in cases of extreme degradation, the mask is replaced by an all-ones face-only mask. Together, these choices indicate that the dataset is tailored for reference selection under the specific restoration assumptions of RefSTAR rather than as a universal benchmark for face correspondence or semantic transfer.

A common misconception would be to treat RefSel-HQ as a dataset of identity labels or facial parsing masks. The dataset summary does not define it that way. Its targets encode transfer eligibility between paired same-identity images, with emphasis on dynamic and static texture conflicts (Yin et al., 14 Jul 2025).

Another possible misconception is that the dataset trains the full restoration backbone. The summary explicitly states that RefSel-HQ is only used to train or presume the RefSel module, while the backbone restoration model uses CelebRef-HQ with separate train/validation splits (Yin et al., 14 Jul 2025).

The official citation information attributes the work to Z. Yin, J. Chen, M. Liu, and W. Zuo under the title “RefSTAR: Blind Facial Image Restoration with Reference Selection, Transfer, and Reconstruction,” and the official repository provides the code, dataset, and pretrained models (Yin et al., 14 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

RefSTAR: Blind Facial Image Restoration with Reference Selection, Transfer, and Reconstruction (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RefSel-HQ Dataset.