Papers
Topics
Authors
Recent
2000 character limit reached

HeadSwapBench: Benchmark Dataset for Head Swapping

Updated 17 December 2025
  • The dataset offers both image and video variants designed for seamless head swapping analysis with fine-grained, semantic annotations.
  • It employs region-specific metrics and mask-free supervision to overcome artifacts and preserve key motion, expression, and pose cues.
  • HeadSwapBench provides a robust framework for benchmarking head swapping algorithms through detailed evaluation of identity, appearance, and continuity.

HeadSwapBench is a benchmark dataset tailored for evaluating and training algorithms in the domain of image-based and video-based head swapping. The dataset, introduced in two variants—image-based in "HS-Diffusion: Semantic-Mixing Diffusion for Head Swapping" (Wang et al., 2022) and video-based in "DirectSwap: Mask-Free Cross-Identity Training and Benchmarking for Expression-Consistent Video Head Swapping" (Wang et al., 10 Dec 2025)—addresses the lack of paired, high-fidelity head swapping data and rigorous region-aware evaluation protocols. It enables fine-grained analysis, mask-free training, and robust benchmarking for models tasked with seamlessly fusing head and body components, preserving geometrical and semantic continuity, and transferring motion and expression across identities.

1. Motivation and Dataset Design Rationale

HeadSwapBench was created in response to two major gaps recognized in head swapping research. First, prior datasets either paired frames of the same individual (which did not test true cross-identity transfer) or lacked ground-truth outputs for swapped identities, obstructing reliable evaluation and training of generative head swapping models. Second, mask-based training and handcrafted inpainting led to boundary artifacts, discontinuities, and loss of expression and pose cues, particularly at the hairline–neck junction and under dynamic motion scenarios (Wang et al., 10 Dec 2025, Wang et al., 2022).

Image-based HeadSwapBench leverages human parsing and semantic segmentation to provide detailed region annotations (head, body, transition), while video-based HeadSwapBench introduces frame-synchronized cross-identity swaps, capturing the full complexity of facial poses and motion dynamics. Both variants are designed to support rigorous region-specific benchmarking and mask-free, direct supervision in training.

2. Image-Based HeadSwapBench: Construction and Annotations

The image-based HeadSwapBench (Wang et al., 2022) is constructed from the SHHQ-1.0 dataset, comprising 39,942 high-resolution, fashion-centric full-body images:

  • Preprocessing includes:
    • Face alignment (Kazemi & Sullivan, 2014) applied on each full-body photograph.
    • Cropping half-body regions (head + torso) at resolutions 256×256 (SHHQ256) and 512×512 (SHHQ512).
    • Human parsing using SCHP yields 20 semantic categories.
  • Data splits:
    • 35,942 images for training.
    • 4,000 images for testing.
  • Semantic annotation protocol:
    • Head region: mH={hathairsunglassesface}m^H = \{\text{hat} \cup \text{hair} \cup \text{sunglasses} \cup \text{face}\}.
    • Body region: mB=m^B = gloves, upper-clothes, dress, coat, socks, pants, etc.
    • Transition region: mr=1(mH+mB)m^r = 1 - (m^H + m^B), capturing neck and overlapping hair patches.
  • File formats and organization:
    • Images and one-hot or indexed PNG semantic masks share base filenames.
    • Masks (head, body) can be generated on-the-fly from semantic files.

This design provides both pixel-wise label fidelity and explicit partitioning for targeted inpainting and evaluation.

3. Video-Based HeadSwapBench: Construction and Paired Supervision

The video-based HeadSwapBench (Wang et al., 10 Dec 2025) comprises genuine frame-synchronized pairs, enabling cross-identity training and benchmarking in video head swapping scenarios:

  • Composition:
    • 8,066 training and 500 test videos, sourced post-processed from HDTF and VFHQ (CVPR’21, CVPRW’22).
    • Clips are in HD (512×512 to 1024×1024, 25–30fps, 2–5s duration).
    • Each data point contains: reference image IbI_b (sampled from each VaV_a), original video VaV_a, and synthetic driven video VdV_d (exhibiting a swapped identity).
  • Pipeline includes:
  1. Qwen2.5-VL-72B-Instruct extracts scene attributes.
  2. PIPNet computes facial landmark trajectories.
  3. VACE configures appearance (identity change) and enforces motion fidelity by landmark matching (NME < 0.3).
  4. Head regions are fused using MediaPipe segmentations, followed by VACE background-aware inpainting.
  5. Deepfake filtering (XceptionNet, real-confidence ≥ 0.7) and manual vetting create the final splits.
  • Annotations per video:
    • Per-frame facial landmarks (68/98 points), head segmentation masks.
    • Pose parameters (Euler angles from SynergyNet).
    • Expression consistency (NME between VdV_d and VaV_a).
    • Scene attribute metadata (background category, lighting, skin tone).

The resulting dataset provides explicit, paired reference—VaV_a is ground truth for dynamic motion, while VdV_d is the output of the swapping pipeline.

4. Benchmark Pairing Protocols and Evaluation Regimes

Image-based Benchmarking

HeadSwapBench’s test split is halved into source-head and source-body pools (2,000 images each), enabling exhaustive 2,000×2,0002,000 \times 2,000 head–body combinations. Synthetic head-swapped images ABA \rightarrow B are constructed, with metrics referencing individual source regions rather than true paired ground-truth (which does not exist in natural images).

Video-based Benchmarking

HeadSwapBench enables direct full-reference benchmarks: every synthetic output VdV_d has the real paired VaV_a for pixel-level, semantic, and temporal consistency evaluation. This supports quantitative assessment of identity fidelity, expression/pose transfer, motion accuracy, and per-frame quality in dynamic settings.

5. Region-Specific Evaluation Metrics

HeadSwapBench introduces specialized metrics for rigorous algorithmic evaluation:

Metric Definition Focus
FID Fréchet Inception Distance: μRμG22+Tr(ΣR+ΣG2(ΣRΣG)1/2)||\mu_R - \mu_G||_2^2 + \operatorname{Tr}(\Sigma_R + \Sigma_G - 2(\Sigma_R \Sigma_G)^{1/2}) Distributional similarity, full image
Mask-FID FID((1mHmB)R, (1mHmB)G)FID((1-m^H-m^B) \odot R,\ (1-m^H-m^B) \odot G) Transition and background region
Focal-FID FID(Cropctr1/2(R), Cropctr1/2(G))FID(\text{Crop}_{ctr}^{1/2}(R),\ \text{Crop}_{ctr}^{1/2}(G)) Central neck/transition area
  • In all cases, feature extraction is via Inception-V3 “pool3” features, using 50 DDIM sampling steps. No additional post-processing is applied apart from the specified crop and masks (Wang et al., 2022).
  • Video HeadSwapBench supports further metrics:
    • Expression Consistency (per-frame NME scores).
    • Motion-aware loss weighting via MEAR: AMEAR=D+αL(1D)A_{MEAR} = D + \alpha \cdot L \circ (1 - D), re-weighting diffusion loss for regions sensitive to motion and expression (Wang et al., 10 Dec 2025).

This enables fine-grained, region-wise, and temporal assessment beyond global image statistics.

6. Baseline Performance and Comparative Results

On SHHQ256 HeadSwapBench, extensive quantitative benchmarks report the effectiveness of diverse head swapping models:

Method IDs↑ Head SSIM↑ Head LPIPS↓ Body SSIM↑ Body LPIPS↓ FID↓ Mask-FID↓ Focal-FID↓
Cut-and-Paste 26.17 31.18
PDGAN .9885 .9941 .0081 .9697 .0422 23.72 57.15 38.66
MAT .9899 .9979 .0007 .9713 .0383 16.11 33.28 18.74
StyleMapGAN .7567 .8992 .0606 .8166 .1278 31.51 24.44 31.94
InsetGAN .8227 .8673 .0962 .8097 .1144 28.39 48.46 25.78
HS-Diffusion .9812 .9689 .0233 .9310 .0517 11.24¹ 18.57¹ 11.80¹

¹ Best result in each column (Wang et al., 2022)

This tabulation highlights where new semantic-mixing diffusion and mask-free approaches deliver improvements in realism, continuity, and region consistency over legacy GAN and cut-and-paste pipelines.

7. Accessibility, Licensing, and Usage

Both image and video variants of HeadSwapBench are announced for public release alongside corresponding codebases (HS-Diffusion: https://github.com/qinghew/HS-Diffusion; DirectSwap: https://github.com/MBZUAI-Group/DirectSwap) under a CC-BY-NC-SA 4.0 license for non-commercial research (Wang et al., 2022, Wang et al., 10 Dec 2025). Citation of the respective source papers is required for any use.

A plausible implication is that the dataset’s comprehensive annotation, paired reference design, and region-aware metrics will facilitate not only direct head-swapping method benchmarking but also research into neck/transition inpainting, motion-consistent synthesis, and algorithmic bias analysis in facial attribute transfer.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to HeadSwapBench Dataset.