HeadSwapBench: Benchmark Dataset for Head Swapping

Updated 17 December 2025

The dataset offers both image and video variants designed for seamless head swapping analysis with fine-grained, semantic annotations.
It employs region-specific metrics and mask-free supervision to overcome artifacts and preserve key motion, expression, and pose cues.
HeadSwapBench provides a robust framework for benchmarking head swapping algorithms through detailed evaluation of identity, appearance, and continuity.

HeadSwapBench is a benchmark dataset tailored for evaluating and training algorithms in the domain of image-based and video-based head swapping. The dataset, introduced in two variants—image-based in "HS-Diffusion: Semantic-Mixing Diffusion for Head Swapping" (Wang et al., 2022) and video-based in "DirectSwap: Mask-Free Cross-Identity Training and Benchmarking for Expression-Consistent Video Head Swapping" (Wang et al., 10 Dec 2025)—addresses the lack of paired, high-fidelity head swapping data and rigorous region-aware evaluation protocols. It enables fine-grained analysis, mask-free training, and robust benchmarking for models tasked with seamlessly fusing head and body components, preserving geometrical and semantic continuity, and transferring motion and expression across identities.

1. Motivation and Dataset Design Rationale

HeadSwapBench was created in response to two major gaps recognized in head swapping research. First, prior datasets either paired frames of the same individual (which did not test true cross-identity transfer) or lacked ground-truth outputs for swapped identities, obstructing reliable evaluation and training of generative head swapping models. Second, mask-based training and handcrafted inpainting led to boundary artifacts, discontinuities, and loss of expression and pose cues, particularly at the hairline–neck junction and under dynamic motion scenarios (Wang et al., 10 Dec 2025, Wang et al., 2022).

Image-based HeadSwapBench leverages human parsing and semantic segmentation to provide detailed region annotations (head, body, transition), while video-based HeadSwapBench introduces frame-synchronized cross-identity swaps, capturing the full complexity of facial poses and motion dynamics. Both variants are designed to support rigorous region-specific benchmarking and mask-free, direct supervision in training.

2. Image-Based HeadSwapBench: Construction and Annotations

The image-based HeadSwapBench (Wang et al., 2022) is constructed from the SHHQ-1.0 dataset, comprising 39,942 high-resolution, fashion-centric full-body images:

Preprocessing includes:
- Face alignment (Kazemi & Sullivan, 2014) applied on each full-body photograph.
- Cropping half-body regions (head + torso) at resolutions 256×256 (SHHQ256) and 512×512 (SHHQ512).
- Human parsing using SCHP yields 20 semantic categories.
Data splits:
- 35,942 images for training.
- 4,000 images for testing.
Semantic annotation protocol:
- Head region: $m^H = \{\text{hat} \cup \text{hair} \cup \text{sunglasses} \cup \text{face}\}$ .
- Body region: $m^B =$ gloves, upper-clothes, dress, coat, socks, pants, etc.
- Transition region: $m^r = 1 - (m^H + m^B)$ , capturing neck and overlapping hair patches.
File formats and organization:
- Images and one-hot or indexed PNG semantic masks share base filenames.
- Masks (head, body) can be generated on-the-fly from semantic files.

This design provides both pixel-wise label fidelity and explicit partitioning for targeted inpainting and evaluation.

3. Video-Based HeadSwapBench: Construction and Paired Supervision

The video-based HeadSwapBench (Wang et al., 10 Dec 2025) comprises genuine frame-synchronized pairs, enabling cross-identity training and benchmarking in video head swapping scenarios:

Composition:
- 8,066 training and 500 test videos, sourced post-processed from HDTF and VFHQ (CVPR’21, CVPRW’22).
- Clips are in HD (512×512 to 1024×1024, 25–30fps, 2–5s duration).
- Each data point contains: reference image $I_b$ (sampled from each $V_a$ ), original video $V_a$ , and synthetic driven video $V_d$ (exhibiting a swapped identity).
Pipeline includes:

Qwen2.5-VL-72B-Instruct extracts scene attributes.
PIPNet computes facial landmark trajectories.
VACE configures appearance (identity change) and enforces motion fidelity by landmark matching (NME < 0.3).
Head regions are fused using MediaPipe segmentations, followed by VACE background-aware inpainting.
Deepfake filtering (XceptionNet, real-confidence ≥ 0.7) and manual vetting create the final splits.

Annotations per video:
- Per-frame facial landmarks (68/98 points), head segmentation masks.
- Pose parameters (Euler angles from SynergyNet).
- Expression consistency (NME between $V_d$ and $V_a$ ).
- Scene attribute metadata (background category, lighting, skin tone).

The resulting dataset provides explicit, paired reference— $V_a$ is ground truth for dynamic motion, while $V_d$ is the output of the swapping pipeline.

4. Benchmark Pairing Protocols and Evaluation Regimes

Image-based Benchmarking

HeadSwapBench’s test split is halved into source-head and source-body pools (2,000 images each), enabling exhaustive $2,000 \times 2,000$ head–body combinations. Synthetic head-swapped images $A \rightarrow B$ are constructed, with metrics referencing individual source regions rather than true paired ground-truth (which does not exist in natural images).

Video-based Benchmarking

HeadSwapBench enables direct full-reference benchmarks: every synthetic output $V_d$ has the real paired $V_a$ for pixel-level, semantic, and temporal consistency evaluation. This supports quantitative assessment of identity fidelity, expression/pose transfer, motion accuracy, and per-frame quality in dynamic settings.

5. Region-Specific Evaluation Metrics

HeadSwapBench introduces specialized metrics for rigorous algorithmic evaluation:

Metric	Definition	Focus
FID	Fréchet Inception Distance: $\|\|\mu_R - \mu_G\|\|_2^2 + \operatorname{Tr}(\Sigma_R + \Sigma_G - 2(\Sigma_R \Sigma_G)^{1/2})$	Distributional similarity, full image
Mask-FID	$FID((1-m^H-m^B) \odot R,\ (1-m^H-m^B) \odot G)$	Transition and background region
Focal-FID	$FID(\text{Crop}_{ctr}^{1/2}(R),\ \text{Crop}_{ctr}^{1/2}(G))$	Central neck/transition area

In all cases, feature extraction is via Inception-V3 “pool3” features, using 50 DDIM sampling steps. No additional post-processing is applied apart from the specified crop and masks (Wang et al., 2022).
Video HeadSwapBench supports further metrics:
- Expression Consistency (per-frame NME scores).
- Motion-aware loss weighting via MEAR: $A_{MEAR} = D + \alpha \cdot L \circ (1 - D)$ , re-weighting diffusion loss for regions sensitive to motion and expression (Wang et al., 10 Dec 2025).

This enables fine-grained, region-wise, and temporal assessment beyond global image statistics.

6. Baseline Performance and Comparative Results

On SHHQ256 HeadSwapBench, extensive quantitative benchmarks report the effectiveness of diverse head swapping models:

Method	IDs↑	Head SSIM↑	Head LPIPS↓	Body SSIM↑	Body LPIPS↓	FID↓	Mask-FID↓	Focal-FID↓
Cut-and-Paste	–	–	–	–	–	26.17	–	31.18
PDGAN	.9885	.9941	.0081	.9697	.0422	23.72	57.15	38.66
MAT	.9899	.9979	.0007	.9713	.0383	16.11	33.28	18.74
StyleMapGAN	.7567	.8992	.0606	.8166	.1278	31.51	24.44	31.94
InsetGAN	.8227	.8673	.0962	.8097	.1144	28.39	48.46	25.78
HS-Diffusion	.9812	.9689	.0233	.9310	.0517	11.24¹	18.57¹	11.80¹

¹ Best result in each column (Wang et al., 2022)

This tabulation highlights where new semantic-mixing diffusion and mask-free approaches deliver improvements in realism, continuity, and region consistency over legacy GAN and cut-and-paste pipelines.

7. Accessibility, Licensing, and Usage

Both image and video variants of HeadSwapBench are announced for public release alongside corresponding codebases (HS-Diffusion: https://github.com/qinghew/HS-Diffusion; DirectSwap: https://github.com/MBZUAI-Group/DirectSwap) under a CC-BY-NC-SA 4.0 license for non-commercial research (Wang et al., 2022, Wang et al., 10 Dec 2025). Citation of the respective source papers is required for any use.

A plausible implication is that the dataset’s comprehensive annotation, paired reference design, and region-aware metrics will facilitate not only direct head-swapping method benchmarking but also research into neck/transition inpainting, motion-consistent synthesis, and algorithmic bias analysis in facial attribute transfer.