ContextDubBench Benchmark Suite
- ContextDubBench is a publicly available audiovisual benchmark suite consisting of 440 meticulously curated video–audio pairs that evaluate audio-driven dubbing systems across diverse languages, subjects, and environments.
- It employs standardized metrics such as FID, FVD, SyncNet confidence, and landmark distance to rigorously assess lip-sync accuracy, visual fidelity, temporal consistency, and identity preservation.
- The dataset features varied content including real humans, stylized characters, and non-human subjects captured under challenging conditions like occlusions, extreme poses, dynamic lighting, and distracting backgrounds.
ContextDubBench is a publicly released, in-the-wild audiovisual benchmark suite containing 440 video–audio pairs rigorously curated for the evaluation of audio-driven visual dubbing models under real-world, diverse, and challenging conditions. It encompasses controlled and uncontrolled footage, multiple language domains, and a variety of subject and environmental factors, serving as a standardized testbed for systematic comparison in areas such as lip-sync accuracy, visual fidelity, identity preservation, and general robustness across practical scenarios (He et al., 31 Dec 2025).
1. Dataset Scope and Structure
ContextDubBench is specifically constructed to support robust evaluation of audio-driven dubbing systems—models tasked with realigning lip movements in video frames to match target speech or singing audio. All samples are provided as paired source video and target audio inputs, along with supporting metadata, ground-truth facial landmarks, and context labels.
Key characteristics:
- Size: 440 video–audio pairs.
- Resolution and Frame Rate: All videos are center-cropped to facial regions, resized to 512 × 512 pixels, and standardized at 25 fps; audios are provided at 16 kHz sampling rate.
- Mean Duration: ~8 seconds/clip, for aggregate duration of approximately 1 hour.
- Split Policy: Entire set is released for evaluation; no official train/validation/test splits.
- Source Diversity: Clips are drawn from public datasets (Civitai, Mixkit, Pexels), augmented with 3D MetaHuman renderings.
2. Data Composition and Scenario Diversity
The benchmark is intentionally composed to expose models to a broad spectrum of dubbing challenges:
- Linguistic Diversity:
- Speech clips (n=350) sourced from Common Voice across six languages/dialects—English (170), Mandarin (60), Cantonese (30), Japanese (30), Russian (30), and French (30).
- Singing clips (n=90), with English (60, NUS-48E) and Mandarin (30, OpenCpop) performances.
- Subject Types:
- Real humans: 291
- Stylized characters: 108
- Non-human/humanoid subjects: 41
- Challenging Visual Conditions:
- Controlled/uncontrolled and dynamic lighting (static, time-varying relighting)
- Partial occlusions (hats, hands, props)
- Extreme poses and full-profile views
- Stylized backgrounds and distracting visual elements
- Preprocessing and Synthesis:
- All input faces are center-cropped and resized, then normalized to [–1,1].
- 3D Morphable Model (3DMM)–based face masking was used during synthetic data generation (in generator training, not in evaluation data).
- Benchmark clips themselves are not altered with additional masking or inpainting.
3. Tasks and Evaluation Protocol
ContextDubBench focuses on four core evaluation axes, each corresponding to critical capabilities of modern dubbing systems:
- Lip-Sync Accuracy: Degree of alignment between generated lip motion and target audio.
- Visual Fidelity & Temporal Consistency: Realism and smoothness of generated frames.
- Identity Preservation: Consistency of identity features compared to input reference, especially under occlusions and pose variation.
- Success Rate: Fraction of samples for which models output plausible, properly synchronized results (i.e., excluding total failures).
Evaluation paradigm:
- Inputs: (a) Source video clip; (b) target audio.
- Outputs: Dubbed video, same duration/resolution/framerate, with target audio and synthesized, audio-aligned facial motion.
Official scripts (Python, PyTorch) and SyncNet checkpoints for objective, reproducible evaluation are provided, using:
- Fréchet Inception Distance (FID), Fréchet Video Distance (FVD) for reference-based realism;
- NIQE, BRISQUE, HyperIQA for no-reference perceptual quality;
- Landmark Distance (LMD) and SyncNet confidence (Sync-C) for lip-sync;
- ArcFace cosine similarity (CSIM), CLIP similarity (CLIPS), and LPIPS for identity preservation.
Table: Core Metrics
| Category | Metric (direction) | Calculation/Notes |
|---|---|---|
| Visual Quality | FID (↓), FVD (↓) | Ref-based; overall realism/dist + video dist |
| Perceptual Quality | NIQE (↓), BRISQUE (↓), HyperIQA (↑) | No-ref; naturalness, IQA |
| Lip-Sync | LMD (↓), Sync-C (↑) | Landmark match, SyncNet confidence |
| Identity | CSIM (↑), CLIPS (↑), LPIPS (↓) | ArcFace/CLIP similarity, perceptual |
| Success Rate | Success % (↑) | Fraction non-failed, synced samples |
Metrics as implemented in provided evaluation scripts.
4. Baseline Comparisons and Benchmark Results
ContextDubBench includes benchmarking of several canonical and state-of-the-art audio-driven dubbing models, grouped as follows:
- GAN-based: Wav2Lip, VideoReTalking, TalkLip, IP-LAP
- Diffusion-based: Diff2Lip, MuseTalk, LatentSync
- Self-Bootstrapped Diffusion Video Editing (labelled "Ours-editor" in results)
The table below summarizes the quantitative performance of these methods on the benchmark (as reported by (He et al., 31 Dec 2025)):
| Method | FID↓ | FVD↓ | NIQE↓ | BRISQUE↓ | HyperIQA↑ | Sync-C↑ | CSIM↑ | Success %↑ |
|---|---|---|---|---|---|---|---|---|
| Wav2Lip | 19.33 | 631.59 | 6.91 | 48.40 | 35.67 | 5.09 | 0.738 | 62.95 |
| VideoReTalking | 17.54 | 341.95 | 6.39 | 43.11 | 44.83 | 5.13 | 0.684 | 59.09 |
| TalkLip | 21.26 | 550.66 | 6.28 | 38.99 | 34.31 | 3.21 | 0.739 | 70.45 |
| IP-LAP | 14.89 | 328.73 | 6.58 | 44.88 | 38.06 | 2.29 | 0.797 | 57.73 |
| Diff2Lip | 17.13 | 378.53 | 6.55 | 44.06 | 36.87 | 4.70 | 0.705 | 71.82 |
| MuseTalk | 17.52 | 294.31 | 6.55 | 43.78 | 42.34 | 2.21 | 0.672 | 60.00 |
| LatentSync | 13.60 | 265.06 | 6.11 | 39.15 | 41.65 | 6.28 | 0.801 | 59.77 |
| Ours-editor | 9.35 | 214.30 | 5.78 | 29.87 | 51.96 | 7.28 | 0.850 | 96.36 |
All metrics: lower is better except where indicated (↑).
Qualitative failure modes for mask-based/inpainting approaches include visible lip-leakage during silence, profile distortion, and artifacts from occlusion; in contrast, approaches leveraging full visual context (as with X-Dub's editor) demonstrate improved lip shape accuracy, identity preservation, and robustness to challenging subject conditions (He et al., 31 Dec 2025).
5. Data Distribution, Licensing, and Usage Guidelines
- Availability: Publicly released under MIT License.
- Access: Download links, metadata, and scripts are hosted at: https://hjrphoebus.github.io/X-Dub/
- Structure:
videos/: 440 .mp4 files (512×512, 25 fps)audios/: Corresponding .wav files (16 kHz)meta.json: Maps video–audio pairs, annotates language, subject type, duration, and landmark data
Recommended usage practices:
- Evaluate candidate models on all 440 examples, reporting each metric listed in Table 2 of (He et al., 31 Dec 2025).
- Any sample for which the model produces a runtime error or fails lip synchronization completely should be excluded from success-rate calculations.
- For reproducibility, utilize official scripts and models for FID/FVD, SyncNet-based sync metrics, and landmark detection.
- When publishing results, cite ContextDubBench explicitly.
6. Positioning Relative to Prior Benchmarks
Precedence benchmarks for audiovisual generation have typically focused on short, phrase-level speech clips, single-language domains, or limited subject variability; most do not assess performance under real-world occlusions, stylization, or extreme poses. ContextDubBench is distinguished by:
- Inclusion of both real and synthetic (stylized/non-human) subjects.
- Language diversity encompassing both speech and singing, across six languages.
- Extensive variance in lighting, pose, occlusion, and background.
- Comprehensive and reproducible metric definitions and scripts.
A plausible implication is that use of ContextDubBench will yield a more realistic assessment of system robustness and a better understanding of error modes under operational conditions than prior controlled-setting benchmarks.
7. Significance and Future Directions
ContextDubBench provides a rigorous, unified foundation for researchers developing and comparing audio-driven dubbing systems. Its construction methodology, focus on scenario diversity, and standardized evaluation protocol address key limitations of earlier datasets. As the field advances, expanding ContextDubBench to include longer-form dialogues, additional subject types, or more extreme adverse conditions could further enhance its utility for the development of context-aware, generalizable dubbing technologies (He et al., 31 Dec 2025).