Papers
Topics
Authors
Recent
Search
2000 character limit reached

Lip-Sync Error-Distance (LSE-D) Metrics

Updated 10 March 2026
  • LSE-D is a metric defined as the mean L2 distance between SyncNet-generated audio and video embeddings, capturing lip-sync quality.
  • It computes distances over sliding windows with techniques like dynamic time warping to ensure accurate temporal alignment.
  • It serves as a de facto benchmark for evaluating generative models in audio-driven facial animation, dubbing, and cross-modal translation.

Lip-Sync Error-Distance (LSE-D) quantifies the alignment of mouth movements in talking face videos with driving speech audio. It is grounded in the computation of Euclidean distances between paired audio and video embeddings produced by a pretrained audiovisual synchrony network—typically SyncNet. As a de facto standard in lip-synchronization evaluation, LSE-D has been widely adopted to benchmark generative models for audio-driven facial animation, dubbed video, and cross-modal translation tasks.

1. Mathematical Formulation and Computation

LSE-D is formally defined as the mean 2\ell_2 distance between SyncNet audio and video embeddings for temporally aligned windows of a target clip. For a test sequence of TT audio–video-aligned frames or windows: LSE-D=1Tt=1Tϕvid(Vt)ϕaud(At)2\mathrm{LSE\text{-}D} = \frac{1}{T} \sum_{t=1}^{T} \big\| \phi_{\rm vid}(V_t) - \phi_{\rm aud}(A_t) \big\|_2 where ϕvid(Vt)\phi_{\rm vid}(V_t) and ϕaud(At)\phi_{\rm aud}(A_t) are the dd-dimensional (typically d=128d=128 or $256$) L2-normalized visual and audio embeddings from SyncNet for frame (or window) tt (Prajwal et al., 2020, Wang et al., 11 Feb 2026, Goncalves et al., 2024).

For practical computation, video frames are aggregated into sliding windows (e.g., 5 or 16 contiguous frames), and the corresponding synchronous audio segment (e.g., 200–640 ms) is processed into a Mel-spectrogram. Embeddings are computed for each window, followed by aggregation of per-window distances dt=ϕvid(Vt)ϕaud(At)2d_t = \|\phi_{\rm vid}(V_t) - \phi_{\rm aud}(A_t)\|_2. No further normalization is required beyond internal SyncNet normalization. When sampling rates or frame rates differ, up- or down-sampling or dynamic time warping (DTW) aligns the embedding streams (Goncalves et al., 2024).

2. Underlying Principles and Rationale

LSE-D exploits the cross-modal embedding space learned by SyncNet, which is trained to co-locate embeddings of correctly synchronized audio-video pairs and separate asynchronous pairs. Minimizing LSE-D thus empirically corresponds to tighter phoneme-to-viseme alignment. This construct is robust to illumination and pose variances due to SyncNet’s in-the-wild training and is grounded in phonetic–visual correspondence, outperforming generic similarity metrics (SSIM, PSNR) for the mouth region (Prajwal et al., 2020, Wang et al., 2022).

3. Implementation Protocols

The canonical LSE-D evaluation protocol entails:

  • Preprocessing: Each video frame undergoes mouth- or face-cropping to SyncNet’s input resolution (commonly 96×9696\times96 or 112×112112\times112), with frames grouped into TT-length windows.
  • Audio: Corresponding audio windows are extracted (0.2–0.64 s), converted to spectrogram form, and aligned to video windows.
  • Embedding Extraction: Pretrained SyncNet encoders generate normalized dd-dimensional feature vectors for each modality.
  • Distance Calculation: For each window, compute 2\ell_2 distance between video and audio embeddings.
  • Aggregation: Per-video means are averaged over all test clips for dataset-level LSE-D.

Some studies use the whole face crop, others a tight mouth region; for 3D talking head synthesis, frames are rendered from predicted meshes prior to cropping (Wang et al., 11 Feb 2026). LSE-D is always minimized for true sync and increases with audio–video misalignment or generative artifacts.

4. Applications in Evaluation and Training

LSE-D is primarily used for quantitative evaluation and comparison of systems. Reported values enable benchmarking on standardized datasets. For example:

Method Dataset LSE-D
Wav2Lip LRS2 7.521
AttnWav2Lip LRS2 7.339
MuseTalk HDTF 8.30
JoyGen HDTF 7.19
3DXTalker Multi-data 13.33
FaceDiffuser Multi-data 12.73

Lower LSE-D correlates with tighter lip–speech alignment (Prajwal et al., 2020, Wang et al., 11 Feb 2026, Wang et al., 2022).

In some settings, LSE-D directly informs model development: (Goncalves et al., 2024) incorporated an explicit lip-synchrony loss based on the SyncNet confidence score (LSE-C’s counterpart), yielding a 9.2% reduction in LSE-D for cross-lingual speech-to-speech translation without degrading perceptual quality.

Ablation studies have shown that embedding additional amplitude or emotion cues into generative models (e.g., 3DXTalker) can meaningfully reduce LSE-D, indicating an improvement in phoneme timing and nuanced lip articulation (Wang et al., 11 Feb 2026).

5. Interpretation and Comparative Benchmarks

Interpretation of LSE-D values depends on context:

  • LSE-D \leq 8: imperceptible sync error
  • 8 << LSE-D \leq 10: minor, potentially visible slip
  • LSE-D >> 10: significant perceptual misalignment

On real, perfectly synced video, LSE-D typically falls in the 6.7–7.0 range (Prajwal et al., 2020, Wang et al., 2022, Goncalves et al., 2024). Randomly mismatched audio–video pairs yield LSE-D in the 12–15 range.

LSE-D’s strengths lie in sensitivity to phoneme–viseme misalignment and robustness to appearance variations. Its limitations include dependence on SyncNet’s domain coverage, potential insensitivity to extreme poses or video compression artifacts, and incomplete capture of full 3D geometric correctness for mesh-based avatars (Wang et al., 11 Feb 2026).

6. Relationship to Other Metrics and Usage in Recent Literature

LSE-D is almost always paired with LSE-C (Lip-Sync Error–Confidence), which measures normalized cosine similarity between audio and video embeddings (higher is better, unlike LSE-D). LSE-D is insensitive to overall image quality (SSIM, PSNR), complementary to geometric errors (e.g., Lip Vertex Error for 3D synthesis), and less susceptible than landmark-based distances to tracking or detection errors (Prajwal et al., 2020, Wang et al., 2022, Wang et al., 11 Feb 2026).

Recent works, including JOLT3D (Park et al., 28 Jul 2025), AttnWav2Lip (Wang et al., 2022), 3DXTalker (Wang et al., 11 Feb 2026), and AVS2S translation (Goncalves et al., 2024), report LSE-D as a primary or secondary metric. None of the recent “MyTalk” (Yu et al., 2024)–style approaches introduce a variant named “LSE-D” in landmark or 3D space; instead, SyncNet-based distances (sometimes called Sync₍dist₎) are universally adopted, with experimental protocols and formulas matching the LSE-D family.

7. Summary Table: Standardization of LSE-D Across Studies

Paper LSE-D Definition Audio-Visual Model Window Size
Wav2Lip (Prajwal et al., 2020) Mean 2\ell_2 SyncNet 5 frames
AttnWav2Lip (Wang et al., 2022) Mean 2\ell_2 SyncNet 5 frames
JOLT3D (Park et al., 28 Jul 2025) Mean 2\ell_2 SyncNet 16 frames
3DXTalker (Wang et al., 11 Feb 2026) Mean 2\ell_2 SyncNet 1 frame (per frame)
AVS2S Translation (Goncalves et al., 2024) Mean 2\ell_2, DTW SyncNet flexible

All implementations adhere to the core principle: evaluate the Euclidean separation of SyncNet audio–visual embeddings, averaged across a series of aligned, fixed-length windows or frames. Lower values universally reflect improved speech–lip synchronization.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Lip-Sync Error-Distance (LSE-D).