Audio-Visual Target Speaker Extraction
- Audio-Visual Target Speaker Extraction (AV-TSE) is defined as isolating a target speaker’s voice from multi-speaker audio by leveraging synchronized visual cues like lip movements and auxiliary information.
- AV-TSE systems integrate deep audio and visual encoders with advanced fusion and masking techniques to optimize signal fidelity and intelligibility across varied acoustic environments.
- Recent approaches employ cross-modal attention, memory mechanisms, and linguistic guidance to robustly address issues like occluded visuals, multiple speakers, and dynamic noise conditions.
Audio-Visual Target Speaker Extraction (AV-TSE) is the problem of isolating a specific speaker’s voice from a multi-speaker audio mixture by leveraging time-aligned visual cues—most commonly, the target speaker’s lip movements—and, increasingly, other auxiliary information (linguistic, contextual, or scene-level). AV-TSE forms a foundational module in robust speech separation, audio-visual speech recognition, hearing assistive devices, and biometrics. State-of-the-art AV-TSE systems typically integrate deep speech and visual feature encoders with sophisticated multi-modal fusion and separation architectures, optimized for both signal fidelity and intelligibility across a range of practical acoustic and visual environments.
1. Formal Problem Definition and Task Taxonomy
The AV-TSE task is formally posed as: Given a single- or multi-channel mixture waveform containing the target speaker plus interfering speakers and possibly non-speech noise, and synchronized video of the target speaker (usually cropped to the mouth region at ), the objective is to produce an estimate closely matching .
The dominant system architecture follows a multi-stage pipeline:
- Audio Encoder: via 1D Conv, e.g., [256 filters, ].
- Visual Encoder: using a pre-trained spatio-temporal CNN and temporal adaptation (e.g., 3D-Conv + ResNet-18 + stack of visual temporal convolution blocks).
- Cross-Modal Fusion and Separation: Concatenation, cross-attention, or more elaborate fusion injects into the masking/decoder network (TCN, DPRNN, Transformer, etc.) to estimate separation mask .
- Waveform Reconstruction: Application of the mask reconstructs .
A crucial refinement is the identification and handling of heterogenous mixture segments: (a) target-present (fully or partially overlapped), (b) target-absent (silent), and (c) dynamic temporal “regimes” (e.g., QQ, SQ, SS, QS in (Pan et al., 2021)). State-of-the-art frameworks must address all such regimes.
2. Architectural Innovations and Fusion Strategies
Advanced AV-TSE architectures integrate a number of specialized modules designed to exploit and compensate for the unique characteristics of audio and visual modalities:
Audio and Visual Encoders
- Audio Encoder: Typically a shallow or moderate-depth Conv1D pipeline (kernel sizes in 16–40 range, channels). Some approaches utilize time-frequency STFT encoders for frequency-domain modeling (Li et al., 28 May 2025).
- Visual Encoder: Standard is a pre-trained 3D-Conv + ResNet-18 trunk for lip ROI embedding, with further adaptation via V-TCN stack or other temporal smoothing/upsampling to match audio frame granularity (Pan et al., 2021, Lin et al., 2023, Wu et al., 24 Mar 2024). These are usually frozen after initial training on a large lip-reading task.
Fusion and Mask Estimation
- Cross-Modal Fusion: Modalities are combined via simple concatenation, learned weighted addition, or cross-modal attention. Advanced methods apply mechanisms such as dual-path RNNs (DPRNN (Pan et al., 2021)), multi-scale Transformer blocks (AV-SepFormer (Lin et al., 2023)), 2D positional encodings, or speaker-cooccurrence attention heads (Pan et al., 27 May 2025).
- Attention-based Modality Weighting: Several works propose reliability-sensitive attention fusion that dynamically weights modalities based on their estimated quality, e.g., normalized attention in (Sato et al., 2021), or context- and confidence-aware modeling (Wu et al., 1 Apr 2025). This provides resilience when visual or audio cues are occluded or degraded.
- Noise Suppression and Exclusivity: Reverse selective auditory attention and subtraction branches (SEANet (Tao et al., 29 Apr 2024)) explicitly model and suppress non-target speech and noise by learning mutually-exclusive representations.
Memory and Momentum Mechanisms
- Temporal Memory for Robustness: MeMo (Li et al., 21 Jul 2025) and related “momentum” designs introduce explicit external memory banks (speaker and contextual embeddings) allowing the model to persist attention on the target, preserving performance when visual cues are missing for an extended period.
Table 1. Key Network Components in Recent AV-TSE systems
| Reference | Audio Encoder | Visual Encoder | Fusion Mechanism |
|---|---|---|---|
| (Pan et al., 2021) | Conv1D (N=256) | 3DConv+ResNet18+V-TCN | DPRNN mask estimator |
| (Lin et al., 2023) | Conv1D, chunked | Frozen lipnet + TCN | Dual-scale Transformer |
| (Wu et al., 1 Apr 2025) | Backbone-agnostic | Backbone-agnostic | MAR, FCS; Adapter blocks |
| (Li et al., 21 Jul 2025) | Conv1D | Visual, Speaker, Context banks | Momentum memory retrieval |
| (Tao et al., 29 Apr 2024) | Conv1D (256) | 3DConv+ResNet18+V-TCN | Dual-path RNN + subtraction |
3. Exploitation of Context, Synchronization, and Linguistic Knowledge
Early AV-TSE approaches relied mainly on local synchronization between lip movement and speech energy, limiting their ability to “fill in” missing cues and resolve ambiguities. Contemporary models incorporate broader contextual and linguistic information to enhance generalization and intelligibility:
- Contextual Mask-and-Recover (MAR): The MAR framework (Wu et al., 24 Mar 2024, Wu et al., 1 Apr 2025) randomly masks contiguous segments of the input or latent embeddings; the system is then trained to recover masked frames by leveraging both intra-modality (long-range speech context) and inter-modality (visual cues). This strategy enforces that the separator network draws on global information, thereby robustifying extraction in visually or acoustically adverse segments.
- Fine-grained Confidence Score (FCS): Confidence prediction (Wu et al., 1 Apr 2025) identifies locally low-quality segments (e.g., high leakage or suppression error), guiding an auxiliary loss to focus learning on these most challenging regions.
- Linguistic Constraints from Pre-trained LMs: Multiple systems (ELEGANCE (Wu et al., 9 Nov 2025, Wu et al., 11 Jun 2025)) train with auxiliary losses that enforce consistency between the separated speech and high-level linguistic or semantic representations (e.g., RoBERTa or WavLM embeddings), or even next-token prediction. This knowledge transfer is applied only during training and is discarded at inference, yielding gains without extra runtime cost.
Table 2. Auxiliary Information and Training-only Guidance Strategies
| Reference | Knowledge Base | Strategy | Impact |
|---|---|---|---|
| (Wu et al., 9 Nov 2025) | RoBERTa, Qwen | Output/Intermediate/Input | +0.6–1.1 dB SI-SDR, OOD/language robust |
| (Wu et al., 11 Jun 2025) | RoBERTa, WavLM, HuBERT | PSLM/PLM Embedding Loss | +1.0–1.4 dB SI-SDRi, cross-domain gains |
| (Wu et al., 24 Mar 2024) | AV-HuBERT | MAR block | +0.6 dB SI-SDR, tighter AV synchronization |
The integration of deep, context-aware linguistic priors addresses scenarios in which visual cues are absent, occluded, or unreliable, and supports cross-lingual, speaker-switching, and multi-talker challenges.
4. Real-World Robustness: Visual Degradation, Co-occurrence, and Edge Constraints
A core challenge for AV-TSE is operating reliably in unconstrained, real-world conditions, including visual impairments, multiple simultaneous faces, far-field/reverberant environments, and computational/resource limits:
- Visual Impairment and Degradation: Models such as USEV with differentiated loss (Pan et al., 2021), AV-SepFormer (Lin et al., 2023), and MAR-based variants (Wu et al., 1 Apr 2025) demonstrate moderate robustness to occluded or missing visemes, low-resolution frames, and even long segments of video loss—maintaining competitive SI-SDR as impairment ratio rises.
- Co-occurring Faces and Activity Cues: The Inter-Speaker Attention Module (ISAM) (Pan et al., 27 May 2025) processes any number of co-occurring faces, dynamically weighing their embeddings via self-attention to prevent confusion during speaker overlap and improve extraction accuracy (+1–2 dB SI-SNRi in multi-face settings).
- Edge Deployment and Causal Processing: Two-stage cascades (Li et al., 28 May 2025) decouple visually-guided voice activity detection (VVAD) from actual separation, resulting in ultra-compact pipelines ( GMac/s, M params, 3ms latency/frame), suitable for real-time deployment on mobile/edge hardware.
Table 3. Robustness Mechanisms and Practical Extensions
| Reference | Condition | Mechanism | Empirical Result |
|---|---|---|---|
| (Li et al., 21 Jul 2025) | Vis. missing | Momentum memory bank | +2 dB SI-SNR, 0.5 dB loss on occlusion |
| (Pan et al., 27 May 2025) | Multiple faces | ISAM | +1.5 dB SI-SNRi (multi-face) |
| (Li et al., 28 May 2025) | Edge device | Two-stage, VVAD | 7.1 dB SI-SNR gain, GMac/s |
5. Training Objectives, Dataset Protocols, and Quantitative Benchmarks
Multi-objective loss functions balance raw signal fidelity, energy suppression in inactive segments, and auxiliary priors (e.g., scenario-aware losses (Pan et al., 2021), MAR-recovery (Wu et al., 24 Mar 2024), linguistic constraints (Wu et al., 11 Jun 2025)). Common component objectives include (i) SI-SDR (scale-invariant SDR), (ii) frame/segment energy penalty in target-absent regions, (iii) auxiliary cross-entropy or MSE to enforce alignment in semantic-linguistic embedding space.
Key datasets for benchmark evaluation include:
- VoxCeleb2-mix: Highly overlapped, unconstrained, hundreds of speakers (Lin et al., 2023, Li et al., 21 Jul 2025).
- IEMOCAP-mix: General mixtures across all overlap ratios, used for scenario-aware performance breakdown (Pan et al., 2021).
- MISP 2023: Realistic multi-channel, far-field, multi-room, and TV noise (Wu et al., 2023).
- LRS2, LRS3, TCD-TIMIT, Grid: Cross-domain and linguistically diverse evaluation.
Notable performance results:
- AV-SepFormer: 12.13 dB SI-SDR within-domain, up to 13.8 dB SI-SDR cross-domain (Lin et al., 2023).
- USEV (D): 13.3 dB SI-SDR on IEMOCAP-mix, strong suppression of false extractions (Pan et al., 2021).
- AVHuMAR-TSE: 12.3 dB SI-SDR, outperforming prior models by 0.9–1 dB (Wu et al., 24 Mar 2024).
- C²AV-TSE: Gains of up to 1.8 dB in SI-SDR in under-performing regimes via MAR and FCS (Wu et al., 1 Apr 2025).
- SEANet: 13.1 dB SI-SDR and 0.5–1.0 dB gain over strong AV-SepFormer baselines, robust across five datasets (Tao et al., 29 Apr 2024).
6. Limitations, Open Challenges, and Directions for Future Research
Despite consistent progress, substantial open research challenges remain:
- Generalization—Visual and Acoustic Mismatch: Most approaches remain evaluated on curated, English-centric or well-posed datasets, with varying performance dropoffs across spontaneous speech, reverberant/noisy/far-field audio, or severe visual occlusions.
- Multi-Talker/Scene Complexity: Handling more than two speakers, dynamic speaker-switching, rapid face-tracking, and non-verbal/TV background interference pose ongoing difficulties.
- Linguistic Adaptation: Training-time only linguistic guidance has advanced cross-domain and cross-language robustness, but methods require transcript availability, and may not generalize to code-switching, low-resource, or unseen languages without further adaptation (Wu et al., 9 Nov 2025, Wu et al., 11 Jun 2025).
- Efficiency—Training and Inference Cost: Top-performing systems often employ large, deep backbones and/or frozen LMs, raising barriers for large-scale or real-time deployment.
- Evaluation Metrics: Traditional SI-SDR and PESQ are partially predictive of downstream ASR performance and perceptual quality, but do not fully capture intelligibility or semantic correctness in adverse conditions.
Active research directions focus on (a) dynamic or learnable context reasoning, (b) parameter- and computation-efficient architectures (e.g., Mamba-based, adapter tuning), (c) explicit modeling of speaker turn-taking, co-occurrence, and scene context, and (d) leveraging unlabeled video via self-supervision and cross-modal consistency.
7. Summary Table: Representative AV-TSE Frameworks
| Framework | Feature Modality | Fusion/Extractor | Key Robustness Mechanism | Task-specific Innovations |
|---|---|---|---|---|
| USEV (Pan et al., 2021) | Lip, waveform | Conv1D+V-TCN+DPRNN | Scenario-aware diff. loss | Handles all overlap, target-absent |
| AV-SepFormer (Lin et al., 2023) | Lip, waveform | Dual-scale Transformer | 2D PE, cross-modal attn | Time-sync alignment, chunking |
| ELEGANCE (Wu et al., 9 Nov 2025) | Lip, LM embeddings | Any backbone | LLM guidance (training only) | Output/intermediate/input linguistic transfer |
| MeMo (Li et al., 21 Jul 2025) | Lip, context mem. | Any streaming backbone | Attentional momentum memory | Streaming visual-impaired robustness |
| C²AV-TSE (Wu et al., 1 Apr 2025) | Any backbone | Adapter + MAR, FCS | Context recov., confidence loss | Error mining, plug-and-play fine-tuning |
| Plug&Play Face-Attn (Pan et al., 27 May 2025) | Multi-face lip | ISAM in AV-DPRNN/TFGridNet | Inter-face attention | Robust to complex scene with multiple faces |
| SEANet (Tao et al., 29 Apr 2024) | Lip, waveform | Dual-path RNN, subtraction | Reverse attention, noise branch | Explicit noise suppression by exclusivity |
The AV-TSE field continues to evolve rapidly, shaped by advances in multi-modal pre-training, context-aware modeling, memory and attention mechanisms, and task-driven robustness objectives. Benchmarks and deployment scenarios increasingly emphasize generalization to realistic, dynamic conversational environments and computationally constrained edge systems.