FacEDiTBench: Talking-Face Editing Benchmark
- FacEDiTBench is the first standardized benchmark for talking-face editing, establishing precise evaluation protocols for localized speech-driven facial motion edits.
- It comprises 250 held-out examples from diverse, high-definition datasets, featuring substitution, insertion, and deletion edits across varied word spans.
- The benchmark introduces rigorous metrics—such as lip-synchronization, identity preservation, and temporal continuity—to drive innovations in dynamic face editing.
FacEDiTBench is the first standardized benchmark designed for the task of talking-face editing, introduced in conjunction with the FacEDiT model to rigorously evaluate fine-grained edits—such as substitution, insertion, and deletion—on speech-driven facial motion sequences. The dataset, annotation protocols, evaluation metrics, and experimental frameworks collectively establish a unified and reproducible standard for assessing dynamic talking-face editing under localized manipulation scenarios, and they provide quantitative as well as qualitative insights into model behavior beyond canonical generation settings (Sung-Bin et al., 16 Dec 2025).
1. Dataset Composition and Characteristics
FacEDiTBench comprises 250 held-out editing examples drawn from three publicly available, high-diversity talking-face video collections: 100 samples from HDTF, 100 from Hallo3, and 50 from CelebV-Dub. Each example consists of a contiguous utterance spanning an edited segment of approximately 1–5 seconds, covering roughly 150–200 unique speaker identities with a balanced mix of genders. The language of the clips is primarily English, with CelebV-Dub contributing a minor proportion of dubbed multilingual samples.
All video material is sourced at HD (720p–1080p) with frame rates between 25 and 30 fps; facial motion frames are center-cropped and re-rendered at the original source rate during editing.
FacEDiTBench includes three distinct edit types:
- Substitution: replacement of a phrase in the transcript/speech, potentially with a sequence of different duration.
- Insertion: addition of a new phrase at a specified temporal location, retaining all original speech before and after.
- Deletion: excision of an existing phrase, followed by temporal stitching of the remaining content.
Edit spans range across three categories by word count: short (1–3 words), medium (4–6 words), and long (7–10 words). The distribution of types and lengths is detailed below.
| Edit span | Insertion | Substitution | Deletion | Total |
|---|---|---|---|---|
| 1–3 words | 9 | 20 | 5 | 34 |
| 4–6 words | 43 | 70 | 11 | 124 |
| 7–10 words | 29 | 59 | 4 | 92 |
2. Task Definitions and Benchmark Protocol
FacEDiTBench is intended strictly as a held-out evaluation set; no model training utilizes its samples. FacEDiT itself is trained on approximately 200 hours of data spanning Hallo3 (130 h), CelebV-Dub (60 h), and HDTF (10 h), with full talking-face generation evaluated separately on the standard HDTF test split.
The canonical editing task protocol consists of the following steps:
- Inputs: original video and transcript, edited transcript, and edited speech synthesized using VoiceCraft.
- Temporal Alignment: automatic detection of start/end timestamps via forced aligners such as WhisperX to localize editable phrases.
- Latent Masking: construction of a binary mask over the facial-motion latent sequence (length ) to correspond to the segment targeted for editing. The mask is expanded or contracted to synchronize with the new speech duration.
- Infilling Model Input: provision of noisy plus masked motion latents (as in flow-matching infilling training), concatenated with edited-speech features.
- Synthesis and Integration: the model predicts the motion latents for the edited segment, which are stitched with the preserved original latents, decoded via LivePortrait, and re-timed to render the edited frames.
No manual frame selection occurs; all intermediate steps leverage automated alignment and processing tools.
3. Evaluation Metrics
Performance is assessed exclusively on the edited span for editing tasks; for generation, metrics are computed across the full clip.
Lip-Synchronization
- LSE-D: Lip Sync Error (, lower is better); computed as distance between SyncNet embeddings of audio and video.
- LSE-C: Lip Sync Confidence (, higher is better); measures cosine similarity between the same embeddings.
Identity Preservation
- IDSIM: Identity similarity (higher is better), calculated as cosine similarity between ArcFace embeddings averaged over edited frames.
Temporal Continuity (Boundary Quality)
- P_cont (Photometric Continuity, lower is better): , where is the last unedited frame and is the first edited frame.
- M_cont (Motion Continuity, lower is better): , with optical flow computed via RAFT.
Other Common Metrics
- FVD: Frechet Video Distance (lower is better), quantifies video-level realism.
- LPIPS: Learned perceptual image patch similarity (lower is better), evaluated only in generation setting.
Implementation utilizes official PyTorch SyncNet, publicly available RAFT, and ArcFace (pre-trained on MS1MV2) for embeddings. All continuity metrics are calculated on center crops, with boundaries established strictly by transcript alignment rather than manual selection.
4. Baseline Models and Comparative Results
Multiple state-of-the-art talking-face generation models are repurposed for the editing task by stitching generated spans into the original video. These include V-Express, AniPortrait, EchoMimic, EchoMimicV2, SadTalker, Hallo, Hallo2, Hallo3, and KeyFace.
Quantitative Results on Editing (FacEDiTBench)
| Method | LSE-D ↓ | LSE-C ↑ | IDSIM ↑ | FVD ↓ | P_cont ↓ | M_cont ↓ |
|---|---|---|---|---|---|---|
| V-Express | 7.719 | 6.381 | 0.845 | 201.39 | 40.67 | 11.47 |
| AniPortrait | 10.601 | 0.904 | 0.905 | 141.57 | 8.28 | 8.28 |
| EchoMimic | 9.313 | 4.289 | 0.885 | 158.63 | 12.92 | 17.89 |
| EchoMimicV2 | 9.698 | 3.559 | 0.866 | 142.93 | 12.36 | 16.51 |
| SadTalker | 7.644 | 6.202 | 0.851 | 112.95 | 8.24 | 7.44 |
| Hallo | 8.244 | 5.664 | 0.909 | 99.45 | 7.24 | 6.43 |
| Hallo2 | 8.205 | 5.724 | 0.918 | 95.26 | 7.60 | 6.00 |
| Hallo3 | 8.754 | 5.559 | 0.880 | 106.53 | 7.65 | 7.31 |
| KeyFace | 9.487 | 4.309 | 0.885 | 110.62 | 7.71 | 7.22 |
| FacEDiT | 7.135 | 6.670 | 0.966 | 61.93 | 2.42 | 0.80 |
FacEDiT achieves the best scores across all reported metrics, demonstrating substantially improved boundary continuity, identity preservation, and lip-synchronization.
Results on Full-Sequence Generation (HDTF Test Split)
| Method | LSE-D ↓ | LSE-C ↑ | IDSIM ↑ | FVD ↓ | LPIPS ↓ |
|---|---|---|---|---|---|
| V-Express | 7.470 | 7.842 | 0.906 | 65.16 | 0.341 |
| AniPortrait | 11.168 | 2.866 | 0.914 | 102.41 | 0.310 |
| EchoMimic | 9.085 | 5.555 | 0.929 | 89.13 | 0.320 |
| EchoMimicV2 | 9.966 | 4.051 | 0.890 | 131.60 | 0.322 |
| SadTalker | 7.557 | 7.421 | 0.848 | 81.57 | 0.309 |
| Hallo | 7.824 | 7.080 | 0.925 | 42.83 | 0.283 |
| Hallo2 | 7.815 | 7.038 | 0.927 | 43.13 | 0.281 |
| Hallo3 | 8.583 | 6.717 | 0.900 | 38.64 | 0.315 |
| KeyFace | 8.905 | 5.795 | 0.926 | 49.04 | 0.267 |
| FacEDiT | 6.950 | 7.960 | 0.930 | 31.66 | 0.289 |
FacEDiT consistently outperforms baselines in both the editing and generation contexts.
5. Key Empirical Observations
FacEDiTBench experiments reveal several critical findings:
- The unified infilling formulation in FacEDiT enables decisively stronger results in the editing setting compared to repurposed generation models.
- Boundary artifacts constitute the principal failure mode for previous methods, manifesting as abrupt photometric or motion discontinuities at edit boundaries. FacEDiT achieves reductions in P_cont of approximately 70–80% and in M_cont of 85–90% relative to baselines.
- Identity drift is mitigated in FacEDiT, which maintains IDSIM ≥ 0.96 throughout, whereas other models often fall below 0.92.
- Lip-synchronization deteriorates for edits of longer spans (7–10 words), with all methods displaying an increase in LSE-D of 5–10% versus short spans. FacEDiT, however, maintains a stable performance margin.
- Insertion edits pose particular challenges due to the need for precise temporal retiming; FacEDiT incorporates a specialized frame-resampling module to address this.
- Qualitative analysis highlights persistent baselines’ issues such as “front-facing bias,” phonetic “mumbled lips,” and identity washes, all of which FacEDiT’s architecture and losses are targeted to ameliorate.
A plausible implication is that FacEDiTBench's design and metrics offer a foundation for rigorous evaluation of localized speech-driven face editing, distinguishing subtle model behaviors that are ill-captured by generation-only protocols.
6. Significance and Research Impact
FacEDiTBench establishes a domain-standard resource for benchmarking talking-face editing, with rigorous protocols, diverse editing scenarios, and comprehensive, diagnosis-oriented metrics. It highlights the importance of boundary continuity and identity consistency, expanding the evaluation space beyond traditional end-to-end or concatenation-based paradigms.
FacEDiTBench enables systematic comparison of models under localized edit conditions and acts as a catalyst for innovations in speech-driven face editing frameworks, model architectures, and evaluation strategies. Its introduction is directly associated with a significant leap in both task definition and performance for the field of talking-face editing and generation (Sung-Bin et al., 16 Dec 2025).