IDBench-V: Video Face Swapping Benchmark
- The paper introduces IDBench-V as a comprehensive benchmark assessing identity fidelity, attribute preservation, and video quality in real-world video face swapping.
- IDBench-V consists of 200 real-world source video–target image pairs designed to probe challenges like small faces, extreme poses, occlusions, and dynamic expressions.
- The evaluation framework integrates automatic metrics (e.g., ID similarity, pose error, FVD) with human evaluations to ensure robust measurement of temporal stability and authenticity.
IDBench-V is a benchmark for video face swapping (VFS) introduced in “DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer” (Guo et al., 4 Jan 2026). It is presented as a comprehensive benchmark encompassing diverse scenes, intended to evaluate whether a method can produce a swapped video that preserves identity fidelity, target-video attributes such as pose and expression, and temporal coherence under real-world conditions. In the reported setting, the benchmark consists of 200 real-world source video–target image pairs, and its protocol evaluates identity consistency, attribute preservation, and video quality / temporal stability (Guo et al., 4 Jan 2026).
1. Motivation and benchmark scope
IDBench-V was introduced to address what the paper describes as a “lack of a benchmark for Video Face Swapping,” and more specifically to remedy evaluation resources that were too limited to rigorously test real-world VFS systems. The benchmark is framed around the central VFS requirement of transferring identity while preserving motion, expression, lighting, background, and temporal stability. The paper positions it as a holistic real-world testbed rather than a narrowly controlled face-swapping dataset (Guo et al., 4 Jan 2026).
The benchmark’s difficulty is defined by scene conditions that matter in practical deployment. The paper identifies small faces, extreme head poses, severe occlusion, complex and dynamic facial expressions, cluttered multi-person scenes, and challenging lighting as the principal hard cases that prior resources did not sufficiently cover. This design also reflects the paper’s broader claim that image face swapping methods are often stronger at identity fidelity than VFS methods, whereas VFS must additionally preserve temporal dynamics; IDBench-V is intended to reveal whether a method actually bridges that gap (Guo et al., 4 Jan 2026).
A key aspect of the benchmark’s task formulation is that it is built around real-world source video–target image pairs. In the evaluation setup, the source video provides motion, pose, expression, background, and related dynamic content, the target image provides the reference identity, and the output is a swapped video with target identity and source-video attributes. The paper explicitly notes that the benchmark is not presented as an image-to-video benchmark in the usual sense (Guo et al., 4 Jan 2026).
2. Dataset composition and pairing regime
IDBench-V contains 200 real-world source video–target image pairs. The appendix wording clarifies this as 200 videos paired with meticulously selected ID images. The benchmark is described as spanning several challenging categories: small faces, extreme head poses, severe occlusions, complex and dynamic expressions, cluttered multi-person scenes, and, more generally, varying head poses, facial expressions, and lighting conditions (Guo et al., 4 Jan 2026).
The paper provides only partial collection details for the benchmark itself. It states that IDBench-V uses real-world source videos and that each video is paired with a carefully selected identity image, with the selected pairs chosen to expose difficult conditions. For IDBench-V specifically, the paper does not provide a detailed annotation protocol such as how identity images were annotated, whether identities were manually verified, whether videos were manually filtered for quality, or whether multiple candidate pairs were scored before selection. The phrase “meticulously selected ID images” nevertheless implies human curation (Guo et al., 4 Jan 2026).
The benchmark is presented as an evaluation benchmark rather than a training set. The provided text does not describe formal train/val/test splits for IDBench-V, and the benchmark appears to be used for testing only. This absence of an explicit split protocol is one of the most consequential procedural properties of the benchmark because it constrains how results can be standardized across future studies (Guo et al., 4 Jan 2026).
3. Evaluation axes, metrics, and formalization
IDBench-V uses a multi-axis evaluation protocol. The reported metrics are organized around identity consistency, attribute preservation, and video quality / temporal stability, with an additional human evaluation component.
| Evaluation axis | Metrics and tools | Direction |
|---|---|---|
| Identity consistency | ID-Arc, ID-Ins, ID-Cur from ArcFace, InsightFace, CurricularFace; Variance | Higher similarity is better; lower variance is better |
| Attribute preservation | Pose via HopeNet; Expression via Deep3DFaceRecon; Background consistency; Subject consistency; Motion smoothness | Lower pose/expression error is better; higher consistency/smoothness is better |
| Video quality | FVD using a ResNeXt feature extractor | Lower is better |
| Human evaluation | 19 volunteers; scores from 1 to 5 on identity similarity, attribute preservation, video quality | Higher is better |
For identity consistency, the appendix states that the methods compute cosine similarity with the target identity image using ArcFace, InsightFace, and CurricularFace. The standard form given in the description is
where is a generated frame, is the target identity image, and is the embedding network. In addition to mean identity scores, the paper reports Variance of frame-wise identity similarity as a temporal-consistency indicator, with lower variance interpreted as better stability over time (Guo et al., 4 Jan 2026).
For attribute preservation, pose and expression are compared against the driving video. Head pose is estimated by HopeNet and expression coefficients are extracted using Deep3DFaceRecon. The metric is the distance between generated and driving-frame attributes:
and
where is a generated frame and is the corresponding driving/reference frame. The benchmark also reports VBench-style metrics—Background consistency, Subject consistency, and Motion smoothness—with higher values preferred (Guo et al., 4 Jan 2026).
For video quality, the benchmark reports FVD (Fréchet Video Distance) using a ResNeXt feature extractor. Human evaluation complements these automatic metrics: 19 volunteers rated samples on identity similarity, attribute preservation, and video quality, using scores from 1 to 5 (Guo et al., 4 Jan 2026).
4. Baseline methods and benchmark outcomes
The benchmark is used to compare both image face swapping and video face swapping methods. The image face swapping baselines, applied frame-by-frame to videos, are FSGAN, REFace, Face-Adapter, and DreamID. The video face swapping baselines are Stand-In and CanonSwap. The paper also mentions VividFace and DynamicFace qualitatively, but states that open-source code was unavailable (Guo et al., 4 Jan 2026).
The main quantitative result is that DreamID-V (“Ours”) attains the strongest overall performance on IDBench-V. For identity consistency, it reports ID-Arc 0.659, ID-Ins 0.713, ID-Cur 0.688, and Variance 0.0029. These are the best identity results in the table, and the lowest reported variance indicates the most stable identity over time. For comparison, DreamID reports ID-Arc 0.616, ID-Ins 0.702, ID-Cur 0.664, and Variance 0.0058, while CanonSwap reports ID-Arc 0.397, ID-Ins 0.431, ID-Cur 0.407, and Variance 0.0030 (Guo et al., 4 Jan 2026).
On attribute preservation, DreamID-V is near-best or best on almost all metrics, with Pose 2.446, Expression 2.430, Background 0.951, and Subject 0.966. The paper notes a slight inferiority to CanonSwap on pose alone, where CanonSwap reports 2.430 versus 2.446 for DreamID-V, but it simultaneously emphasizes that CanonSwap has very weak identity transfer. On video quality, DreamID-V reports FVD 2.243 and Smoothness 0.992, outperforming the image-based baselines and remaining strong among all compared methods (Guo et al., 4 Jan 2026).
The benchmark also supports human preference analysis. In the user study, DreamID-V receives the highest scores overall: 3.85 for identity similarity, 4.22 for attribute preservation, and 4.15 for video quality. The corresponding scores for DreamID are 3.78, 3.89, and 3.06; for CanonSwap, 1.99, 3.91, and 3.42; for Stand-In, 2.45, 1.60, and 2.91; for Face-Adapter, 2.17, 2.93, and 1.14; and for REFace, 1.45, 2.15, and 1.11. These results indicate that the benchmark’s automatic metrics and human judgments are being used jointly to assess identity fidelity, attribute preservation, and realism (Guo et al., 4 Jan 2026).
5. Ablation studies and what the benchmark reveals
The paper includes an ablation study over four design choices: w/o Quadruplet, w/o ST, w/o RAT, w/o IRL, and Ours. The reported metrics align with the benchmark’s main evaluation axes. The ablation values are: w/o Quadruplet, ID-Arc 0.510, variance 0.0036, Pose 2.468, Exp 2.432, FVD 2.242; w/o ST, ID-Arc 0.604, variance 0.0035, Pose 2.742, Exp 2.445, FVD 2.145; w/o RAT, ID-Arc 0.657, variance 0.0042, Pose 2.557, Exp 2.443, FVD 3.845; w/o IRL, ID-Arc 0.631, variance 0.0041, Pose 2.687, Exp 2.488, FVD 2.206; and Ours, ID-Arc 0.659, variance 0.0029, Pose 2.446, Exp 2.430, FVD 2.243 (Guo et al., 4 Jan 2026).
These ablations are used to justify both the method design and the benchmark design. The paper states that w/o Quadruplet performs worst in identity similarity, showing that explicit paired supervision is crucial. It states that w/o ST gives better realism but worse identity, showing the synthetic stage helps identity transfer, while w/o RAT gives strong identity but worse realism, showing the real-augmentation stage is needed to restore visual fidelity. It also states that w/o IRL reduces temporal consistency and performs worse on difficult motion cases, showing IRL helps stabilize identity under large pose changes (Guo et al., 4 Jan 2026).
The benchmark’s significance becomes clearest in this ablation setting because the paper argues that these improvements are only visible when evaluation simultaneously measures identity, attribute fidelity, realism, and temporal stability. This suggests that IDBench-V is not merely a dataset for reporting identity similarity, but an evaluation regime in which trade-offs between facial identity transfer and video fidelity are structurally exposed (Guo et al., 4 Jan 2026).
6. Limitations and prospective extensions
The paper does not present a long dedicated limitations section for IDBench-V, but several constraints are explicitly or implicitly identified. First, benchmark size is modest: it contains only 200 pairs. Second, the benchmark is centered on the video-plus-target-image setting, so it does not explore all possible VFS deployment configurations. Third, no formal train/val/test split is described, which limits how the benchmark is used for standardized leaderboards unless additional protocol details are provided elsewhere. Fourth, the benchmark is specialized to face swapping rather than broader identity-transfer problems (Guo et al., 4 Jan 2026).
The benchmark’s collection process is also only partially documented in the provided text. While the paper states that videos are real-world and ID images are meticulously selected, it does not provide a full annotation or filtering protocol for the benchmark. For research usage, this means that the benchmark’s strengths lie more in its task definition and evaluation breadth than in exhaustive procedural transparency (Guo et al., 4 Jan 2026).
The broader DreamID-V paper suggests future expansion beyond facial identity transfer. Its versatility discussion indicates that the same pipeline could be adapted to accessory, outfit, headphone, and hairstyle swapping. A plausible implication is that IDBench-V could serve as the nucleus of a broader benchmark family for temporally coherent identity- or appearance-transfer tasks, although the benchmark as reported remains specialized to VFS (Guo et al., 4 Jan 2026).