Audio-Visual Target Speaker Extraction

Updated 16 November 2025

Audio-Visual Target Speaker Extraction (AV-TSE) is defined as isolating a target speaker’s voice from multi-speaker audio by leveraging synchronized visual cues like lip movements and auxiliary information.
AV-TSE systems integrate deep audio and visual encoders with advanced fusion and masking techniques to optimize signal fidelity and intelligibility across varied acoustic environments.
Recent approaches employ cross-modal attention, memory mechanisms, and linguistic guidance to robustly address issues like occluded visuals, multiple speakers, and dynamic noise conditions.

Audio-Visual Target Speaker Extraction (AV-TSE) is the problem of isolating a specific speaker’s voice from a multi-speaker audio mixture by leveraging time-aligned visual cues—most commonly, the target speaker’s lip movements—and, increasingly, other auxiliary information (linguistic, contextual, or scene-level). AV-TSE forms a foundational module in robust speech separation, audio-visual speech recognition, hearing assistive devices, and biometrics. State-of-the-art AV-TSE systems typically integrate deep speech and visual feature encoders with sophisticated multi-modal fusion and separation architectures, optimized for both signal fidelity and intelligibility across a range of practical acoustic and visual environments.

1. Formal Problem Definition and Task Taxonomy

The AV-TSE task is formally posed as: Given a single- or multi-channel mixture waveform $x(t)$ containing the target speaker $s(t)$ plus $I$ interfering speakers and possibly non-speech noise, and synchronized video $v = \{v_1, ..., v_T\}$ of the target speaker (usually cropped to the mouth region at $25~\mathrm{fps}$ ), the objective is to produce an estimate $\hat{s}(t)$ closely matching $s(t)$ .

The dominant system architecture follows a multi-stage pipeline:

Audio Encoder: $x(t) \rightarrow X(t) \in \mathbb{R}^{N \times T}$ via 1D Conv, e.g., [256 filters, $L=40$ ].
Visual Encoder: $v \rightarrow V(t) \in \mathbb{R}^{N \times T}$ using a pre-trained spatio-temporal CNN and temporal adaptation (e.g., 3D-Conv + ResNet-18 + stack of visual temporal convolution blocks).
Cross-Modal Fusion and Separation: Concatenation, cross-attention, or more elaborate fusion injects $V(t)$ into the masking/decoder network (TCN, DPRNN, Transformer, etc.) to estimate separation mask $M(t)$ .
Waveform Reconstruction: Application of the mask reconstructs $\hat{s}(t) = \mathrm{Decoder}[X(t) \odot M(t)]$ .

A crucial refinement is the identification and handling of heterogenous mixture segments: (a) target-present (fully or partially overlapped), (b) target-absent (silent), and (c) dynamic temporal “regimes” (e.g., QQ, SQ, SS, QS in (Pan et al., 2021)). State-of-the-art frameworks must address all such regimes.

2. Architectural Innovations and Fusion Strategies

Advanced AV-TSE architectures integrate a number of specialized modules designed to exploit and compensate for the unique characteristics of audio and visual modalities:

Audio and Visual Encoders

Audio Encoder: Typically a shallow or moderate-depth Conv1D pipeline (kernel sizes $L$ in 16–40 range, $N=128–256$ channels). Some approaches utilize time-frequency STFT encoders for frequency-domain modeling (Li et al., 28 May 2025).
Visual Encoder: Standard is a pre-trained 3D-Conv + ResNet-18 trunk for lip ROI embedding, with further adaptation via V-TCN stack or other temporal smoothing/upsampling to match audio frame granularity (Pan et al., 2021, Lin et al., 2023, Wu et al., 2024). These are usually frozen after initial training on a large lip-reading task.

Fusion and Mask Estimation

Cross-Modal Fusion: Modalities are combined via simple concatenation, learned weighted addition, or cross-modal attention. Advanced methods apply mechanisms such as dual-path RNNs (DPRNN (Pan et al., 2021)), multi-scale Transformer blocks (AV-SepFormer (Lin et al., 2023)), 2D positional encodings, or speaker-cooccurrence attention heads (Pan et al., 27 May 2025).
Attention-based Modality Weighting: Several works propose reliability-sensitive attention fusion that dynamically weights modalities based on their estimated quality, e.g., normalized attention in (Sato et al., 2021), or context- and confidence-aware modeling (Wu et al., 1 Apr 2025). This provides resilience when visual or audio cues are occluded or degraded.
Noise Suppression and Exclusivity: Reverse selective auditory attention and subtraction branches (SEANet (Tao et al., 2024)) explicitly model and suppress non-target speech and noise by learning mutually-exclusive representations.

Memory and Momentum Mechanisms

Temporal Memory for Robustness: MeMo (Li et al., 21 Jul 2025) and related “momentum” designs introduce explicit external memory banks (speaker and contextual embeddings) allowing the model to persist attention on the target, preserving performance when visual cues are missing for an extended period.

Table 1. Key Network Components in Recent AV-TSE systems

Reference	Audio Encoder	Visual Encoder	Fusion Mechanism
(Pan et al., 2021)	Conv1D (N=256)	3DConv+ResNet18+V-TCN	DPRNN mask estimator
(Lin et al., 2023)	Conv1D, chunked	Frozen lipnet + TCN	Dual-scale Transformer
(Wu et al., 1 Apr 2025)	Backbone-agnostic	Backbone-agnostic	MAR, FCS; Adapter blocks
(Li et al., 21 Jul 2025)	Conv1D	Visual, Speaker, Context banks	Momentum memory retrieval
(Tao et al., 2024)	Conv1D (256)	3DConv+ResNet18+V-TCN	Dual-path RNN + subtraction

3. Exploitation of Context, Synchronization, and Linguistic Knowledge

Early AV-TSE approaches relied mainly on local synchronization between lip movement and speech energy, limiting their ability to “fill in” missing cues and resolve ambiguities. Contemporary models incorporate broader contextual and linguistic information to enhance generalization and intelligibility:

Contextual Mask-and-Recover (MAR): The MAR framework (Wu et al., 2024, Wu et al., 1 Apr 2025) randomly masks contiguous segments of the input or latent embeddings; the system is then trained to recover masked frames by leveraging both intra-modality (long-range speech context) and inter-modality (visual cues). This strategy enforces that the separator network draws on global information, thereby robustifying extraction in visually or acoustically adverse segments.
Fine-grained Confidence Score (FCS): Confidence prediction (Wu et al., 1 Apr 2025) identifies locally low-quality segments (e.g., high leakage or suppression error), guiding an auxiliary loss to focus learning on these most challenging regions.
Linguistic Constraints from Pre-trained LMs: Multiple systems (ELEGANCE (Wu et al., 9 Nov 2025, Wu et al., 11 Jun 2025)) train with auxiliary losses that enforce consistency between the separated speech and high-level linguistic or semantic representations (e.g., RoBERTa or WavLM embeddings), or even next-token prediction. This knowledge transfer is applied only during training and is discarded at inference, yielding gains without extra runtime cost.

Table 2. Auxiliary Information and Training-only Guidance Strategies

Reference	Knowledge Base	Strategy	Impact
(Wu et al., 9 Nov 2025)	RoBERTa, Qwen	Output/Intermediate/Input	+0.6–1.1 dB SI-SDR, OOD/language robust
(Wu et al., 11 Jun 2025)	RoBERTa, WavLM, HuBERT	PSLM/PLM Embedding Loss	+1.0–1.4 dB SI-SDRi, cross-domain gains
(Wu et al., 2024)	AV-HuBERT	MAR block	+0.6 dB SI-SDR, tighter AV synchronization

The integration of deep, context-aware linguistic priors addresses scenarios in which visual cues are absent, occluded, or unreliable, and supports cross-lingual, speaker-switching, and multi-talker challenges.

4. Real-World Robustness: Visual Degradation, Co-occurrence, and Edge Constraints

A core challenge for AV-TSE is operating reliably in unconstrained, real-world conditions, including visual impairments, multiple simultaneous faces, far-field/reverberant environments, and computational/resource limits:

Visual Impairment and Degradation: Models such as USEV with differentiated loss (Pan et al., 2021), AV-SepFormer (Lin et al., 2023), and MAR-based variants (Wu et al., 1 Apr 2025) demonstrate moderate robustness to occluded or missing visemes, low-resolution frames, and even long segments of video loss—maintaining competitive SI-SDR as impairment ratio rises.
Co-occurring Faces and Activity Cues: The Inter-Speaker Attention Module (ISAM) (Pan et al., 27 May 2025) processes any number of co-occurring faces, dynamically weighing their embeddings via self-attention to prevent confusion during speaker overlap and improve extraction accuracy (+1–2 dB SI-SNRi in multi-face settings).
Edge Deployment and Causal Processing: Two-stage cascades (Li et al., 28 May 2025) decouple visually-guided voice activity detection (VVAD) from actual separation, resulting in ultra-compact pipelines ( $<2$ GMac/s, $<1.5$ M params, $<$ 3ms latency/frame), suitable for real-time deployment on mobile/edge hardware.

Table 3. Robustness Mechanisms and Practical Extensions

Reference	Condition	Mechanism	Empirical Result
(Li et al., 21 Jul 2025)	Vis. missing	Momentum memory bank	+2 dB SI-SNR, 0.5 dB loss on occlusion
(Pan et al., 27 May 2025)	Multiple faces	ISAM	+1.5 dB SI-SNRi (multi-face)
(Li et al., 28 May 2025)	Edge device	Two-stage, VVAD	7.1 dB SI-SNR gain, $<2$ GMac/s

5. Training Objectives, Dataset Protocols, and Quantitative Benchmarks

Multi-objective loss functions balance raw signal fidelity, energy suppression in inactive segments, and auxiliary priors (e.g., scenario-aware losses (Pan et al., 2021), MAR-recovery (Wu et al., 2024), linguistic constraints (Wu et al., 11 Jun 2025)). Common component objectives include (i) SI-SDR (scale-invariant SDR), (ii) frame/segment energy penalty in target-absent regions, (iii) auxiliary cross-entropy or MSE to enforce alignment in semantic-linguistic embedding space.

Key datasets for benchmark evaluation include:

VoxCeleb2-mix: Highly overlapped, unconstrained, hundreds of speakers (Lin et al., 2023, Li et al., 21 Jul 2025).
IEMOCAP-mix: General mixtures across all overlap ratios, used for scenario-aware performance breakdown (Pan et al., 2021).
MISP 2023: Realistic multi-channel, far-field, multi-room, and TV noise (Wu et al., 2023).
LRS2, LRS3, TCD-TIMIT, Grid: Cross-domain and linguistically diverse evaluation.

Notable performance results:

AV-SepFormer: 12.13 dB SI-SDR within-domain, up to 13.8 dB SI-SDR cross-domain (Lin et al., 2023).
USEV (D): 13.3 dB SI-SDR on IEMOCAP-mix, strong suppression of false extractions (Pan et al., 2021).
AVHuMAR-TSE: 12.3 dB SI-SDR, outperforming prior models by 0.9–1 dB (Wu et al., 2024).
C²AV-TSE: Gains of up to 1.8 dB in SI-SDR in under-performing regimes via MAR and FCS (Wu et al., 1 Apr 2025).
SEANet: 13.1 dB SI-SDR and 0.5–1.0 dB gain over strong AV-SepFormer baselines, robust across five datasets (Tao et al., 2024).

6. Limitations, Open Challenges, and Directions for Future Research

Despite consistent progress, substantial open research challenges remain:

Generalization—Visual and Acoustic Mismatch: Most approaches remain evaluated on curated, English-centric or well-posed datasets, with varying performance dropoffs across spontaneous speech, reverberant/noisy/far-field audio, or severe visual occlusions.
Multi-Talker/Scene Complexity: Handling more than two speakers, dynamic speaker-switching, rapid face-tracking, and non-verbal/TV background interference pose ongoing difficulties.
Linguistic Adaptation: Training-time only linguistic guidance has advanced cross-domain and cross-language robustness, but methods require transcript availability, and may not generalize to code-switching, low-resource, or unseen languages without further adaptation (Wu et al., 9 Nov 2025, Wu et al., 11 Jun 2025).
Efficiency—Training and Inference Cost: Top-performing systems often employ large, deep backbones and/or frozen LMs, raising barriers for large-scale or real-time deployment.
Evaluation Metrics: Traditional SI-SDR and PESQ are partially predictive of downstream ASR performance and perceptual quality, but do not fully capture intelligibility or semantic correctness in adverse conditions.

Active research directions focus on (a) dynamic or learnable context reasoning, (b) parameter- and computation-efficient architectures (e.g., Mamba-based, adapter tuning), (c) explicit modeling of speaker turn-taking, co-occurrence, and scene context, and (d) leveraging unlabeled video via self-supervision and cross-modal consistency.

7. Summary Table: Representative AV-TSE Frameworks

Framework	Feature Modality	Fusion/Extractor	Key Robustness Mechanism	Task-specific Innovations
USEV (Pan et al., 2021)	Lip, waveform	Conv1D+V-TCN+DPRNN	Scenario-aware diff. loss	Handles all overlap, target-absent
AV-SepFormer (Lin et al., 2023)	Lip, waveform	Dual-scale Transformer	2D PE, cross-modal attn	Time-sync alignment, chunking
ELEGANCE (Wu et al., 9 Nov 2025)	Lip, LM embeddings	Any backbone	LLM guidance (training only)	Output/intermediate/input linguistic transfer
MeMo (Li et al., 21 Jul 2025)	Lip, context mem.	Any streaming backbone	Attentional momentum memory	Streaming visual-impaired robustness
C²AV-TSE (Wu et al., 1 Apr 2025)	Any backbone	Adapter + MAR, FCS	Context recov., confidence loss	Error mining, plug-and-play fine-tuning
Plug&Play Face-Attn (Pan et al., 27 May 2025)	Multi-face lip	ISAM in AV-DPRNN/TFGridNet	Inter-face attention	Robust to complex scene with multiple faces
SEANet (Tao et al., 2024)	Lip, waveform	Dual-path RNN, subtraction	Reverse attention, noise branch	Explicit noise suppression by exclusivity

The AV-TSE field continues to evolve rapidly, shaped by advances in multi-modal pre-training, context-aware modeling, memory and attention mechanisms, and task-driven robustness objectives. Benchmarks and deployment scenarios increasingly emphasize generalization to realistic, dynamic conversational environments and computationally constrained edge systems.