Lip-Sync Error-Distance (LSE-D) Metrics
- LSE-D is a metric defined as the mean L2 distance between SyncNet-generated audio and video embeddings, capturing lip-sync quality.
- It computes distances over sliding windows with techniques like dynamic time warping to ensure accurate temporal alignment.
- It serves as a de facto benchmark for evaluating generative models in audio-driven facial animation, dubbing, and cross-modal translation.
Lip-Sync Error-Distance (LSE-D) quantifies the alignment of mouth movements in talking face videos with driving speech audio. It is grounded in the computation of Euclidean distances between paired audio and video embeddings produced by a pretrained audiovisual synchrony network—typically SyncNet. As a de facto standard in lip-synchronization evaluation, LSE-D has been widely adopted to benchmark generative models for audio-driven facial animation, dubbed video, and cross-modal translation tasks.
1. Mathematical Formulation and Computation
LSE-D is formally defined as the mean distance between SyncNet audio and video embeddings for temporally aligned windows of a target clip. For a test sequence of audio–video-aligned frames or windows: where and are the -dimensional (typically or $256$) L2-normalized visual and audio embeddings from SyncNet for frame (or window) (Prajwal et al., 2020, Wang et al., 11 Feb 2026, Goncalves et al., 2024).
For practical computation, video frames are aggregated into sliding windows (e.g., 5 or 16 contiguous frames), and the corresponding synchronous audio segment (e.g., 200–640 ms) is processed into a Mel-spectrogram. Embeddings are computed for each window, followed by aggregation of per-window distances . No further normalization is required beyond internal SyncNet normalization. When sampling rates or frame rates differ, up- or down-sampling or dynamic time warping (DTW) aligns the embedding streams (Goncalves et al., 2024).
2. Underlying Principles and Rationale
LSE-D exploits the cross-modal embedding space learned by SyncNet, which is trained to co-locate embeddings of correctly synchronized audio-video pairs and separate asynchronous pairs. Minimizing LSE-D thus empirically corresponds to tighter phoneme-to-viseme alignment. This construct is robust to illumination and pose variances due to SyncNet’s in-the-wild training and is grounded in phonetic–visual correspondence, outperforming generic similarity metrics (SSIM, PSNR) for the mouth region (Prajwal et al., 2020, Wang et al., 2022).
3. Implementation Protocols
The canonical LSE-D evaluation protocol entails:
- Preprocessing: Each video frame undergoes mouth- or face-cropping to SyncNet’s input resolution (commonly or ), with frames grouped into -length windows.
- Audio: Corresponding audio windows are extracted (0.2–0.64 s), converted to spectrogram form, and aligned to video windows.
- Embedding Extraction: Pretrained SyncNet encoders generate normalized -dimensional feature vectors for each modality.
- Distance Calculation: For each window, compute distance between video and audio embeddings.
- Aggregation: Per-video means are averaged over all test clips for dataset-level LSE-D.
Some studies use the whole face crop, others a tight mouth region; for 3D talking head synthesis, frames are rendered from predicted meshes prior to cropping (Wang et al., 11 Feb 2026). LSE-D is always minimized for true sync and increases with audio–video misalignment or generative artifacts.
4. Applications in Evaluation and Training
LSE-D is primarily used for quantitative evaluation and comparison of systems. Reported values enable benchmarking on standardized datasets. For example:
| Method | Dataset | LSE-D |
|---|---|---|
| Wav2Lip | LRS2 | 7.521 |
| AttnWav2Lip | LRS2 | 7.339 |
| MuseTalk | HDTF | 8.30 |
| JoyGen | HDTF | 7.19 |
| 3DXTalker | Multi-data | 13.33 |
| FaceDiffuser | Multi-data | 12.73 |
Lower LSE-D correlates with tighter lip–speech alignment (Prajwal et al., 2020, Wang et al., 11 Feb 2026, Wang et al., 2022).
In some settings, LSE-D directly informs model development: (Goncalves et al., 2024) incorporated an explicit lip-synchrony loss based on the SyncNet confidence score (LSE-C’s counterpart), yielding a 9.2% reduction in LSE-D for cross-lingual speech-to-speech translation without degrading perceptual quality.
Ablation studies have shown that embedding additional amplitude or emotion cues into generative models (e.g., 3DXTalker) can meaningfully reduce LSE-D, indicating an improvement in phoneme timing and nuanced lip articulation (Wang et al., 11 Feb 2026).
5. Interpretation and Comparative Benchmarks
Interpretation of LSE-D values depends on context:
- LSE-D 8: imperceptible sync error
- 8 LSE-D 10: minor, potentially visible slip
- LSE-D 10: significant perceptual misalignment
On real, perfectly synced video, LSE-D typically falls in the 6.7–7.0 range (Prajwal et al., 2020, Wang et al., 2022, Goncalves et al., 2024). Randomly mismatched audio–video pairs yield LSE-D in the 12–15 range.
LSE-D’s strengths lie in sensitivity to phoneme–viseme misalignment and robustness to appearance variations. Its limitations include dependence on SyncNet’s domain coverage, potential insensitivity to extreme poses or video compression artifacts, and incomplete capture of full 3D geometric correctness for mesh-based avatars (Wang et al., 11 Feb 2026).
6. Relationship to Other Metrics and Usage in Recent Literature
LSE-D is almost always paired with LSE-C (Lip-Sync Error–Confidence), which measures normalized cosine similarity between audio and video embeddings (higher is better, unlike LSE-D). LSE-D is insensitive to overall image quality (SSIM, PSNR), complementary to geometric errors (e.g., Lip Vertex Error for 3D synthesis), and less susceptible than landmark-based distances to tracking or detection errors (Prajwal et al., 2020, Wang et al., 2022, Wang et al., 11 Feb 2026).
Recent works, including JOLT3D (Park et al., 28 Jul 2025), AttnWav2Lip (Wang et al., 2022), 3DXTalker (Wang et al., 11 Feb 2026), and AVS2S translation (Goncalves et al., 2024), report LSE-D as a primary or secondary metric. None of the recent “MyTalk” (Yu et al., 2024)–style approaches introduce a variant named “LSE-D” in landmark or 3D space; instead, SyncNet-based distances (sometimes called Sync₍dist₎) are universally adopted, with experimental protocols and formulas matching the LSE-D family.
7. Summary Table: Standardization of LSE-D Across Studies
| Paper | LSE-D Definition | Audio-Visual Model | Window Size |
|---|---|---|---|
| Wav2Lip (Prajwal et al., 2020) | Mean | SyncNet | 5 frames |
| AttnWav2Lip (Wang et al., 2022) | Mean | SyncNet | 5 frames |
| JOLT3D (Park et al., 28 Jul 2025) | Mean | SyncNet | 16 frames |
| 3DXTalker (Wang et al., 11 Feb 2026) | Mean | SyncNet | 1 frame (per frame) |
| AVS2S Translation (Goncalves et al., 2024) | Mean , DTW | SyncNet | flexible |
All implementations adhere to the core principle: evaluate the Euclidean separation of SyncNet audio–visual embeddings, averaged across a series of aligned, fixed-length windows or frames. Lower values universally reflect improved speech–lip synchronization.