Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 71 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 17 tok/s Pro
GPT-4o 111 tok/s Pro
Kimi K2 161 tok/s Pro
GPT OSS 120B 412 tok/s Pro
Claude Sonnet 4 35 tok/s Pro
2000 character limit reached

Real-Time Audio-Visual Speech Enhancement

Updated 28 September 2025
  • Real-time audio-visual speech enhancement is a technique that isolates a target speaker's voice from noise using synchronized audio signals and lip-region cues under strict latency constraints.
  • Modern systems employ deep neural architectures with causal convolution, LSTM, and attention-based fusion to maintain robust performance even in adverse SNR conditions.
  • Key practical insights include low algorithmic latency (<120 ms/frame), the use of phase re-synthesis for natural sound quality, and resilience to asynchrony and sensor dropouts.

Real-time audio-visual speech enhancement (AVSE) refers to the isolation and restoration of a target speaker’s voice from noisy mixtures using synchronized audio and visual (typically lip region) cues, processed under low-latency computational constraints. Modern AVSE systems leverage deep neural architectures to exploit the complementary strengths of both modalities, achieving robust speech intelligibility and quality even for previously unseen speakers, unconstrained environments, and adverse signal-to-noise ratios (SNRs). This entry surveys the key methods and principles, technical innovations, empirical results, and challenges underlying this domain, with a focus on systems that achieve real-time, low-latency operation.

1. Architectural Principles for Real-time AVSE

Real-time AVSE architectures consistently pursue causality, frame-synchronous inference, and rapid fusion of modalities. Canonical pipelines (e.g., (Afouras et al., 2018, Ma et al., 29 Jul 2025, Chen et al., 10 Jul 2024)) feature:

  • Visual front-ends: Extraction of temporally-synchronized embeddings from cropped lip-region images, typically via 3D convolutional or residual (ResNet) backbones. In real-time designs (e.g., RT-LA-VocE (Chen et al., 10 Jul 2024), RAVEN (Ma et al., 29 Jul 2025)), all components are re-engineered for strict causality—visual convolutions are padded only on past frames and lookahead is minimized or eliminated.
  • Audio front-ends: Conversion of raw speech to log-magnitude spectrograms or raw-feature encodings (STFT, log1p, or causal 1D ResNet (Chen et al., 10 Jul 2024)) with parameters tuned for minimal windowing delay (window sizes as short as 40 ms).
  • Fusion mechanisms: Temporal convolution (TCN), LSTM, or lightweight attention modules merge the audio and visual streams. Some systems employ gating (AV-E3Net (Zhu et al., 2023)), cross-modality attention (AVSEC3 (Saleem et al., 26 Aug 2025)), or late-stage LSTM/FC fusion (RAVEN (Ma et al., 29 Jul 2025, Ma et al., 25 Sep 2025)).
  • Spectrogram masking or re-synthesis: Most real-time systems estimate a soft mask M\mathbf{M} applied to the noisy magnitude spectrogram, enhancing the corresponding frequency bins for the target speaker:

M^=σ(WmfAV)Mn\hat{M} = \sigma(\mathbf{W}_m^\top f_{\text{AV}}) \odot M_n

where fAVf_{\text{AV}} denotes the fused features and MnM_n the noisy magnitude (Afouras et al., 2018). Recent systems increasingly pursue re-synthesis via causal neural vocoders (C-HiFi-GAN (Chen et al., 10 Jul 2024)) or diffusion models (Chou et al., 2023, Lin et al., 23 Jan 2025).

  • Phase estimation/refinement: Phase sub-networks predict a residual that refines the noisy phase, e.g.,

Φ^=Wϕϕ6+ΦnWϕϕ6+Φn2\hat{\Phi} = \frac{\mathbf{W}_\phi^\top \phi_{6} + \Phi_n}{\|\mathbf{W}_\phi^\top \phi_{6} + \Phi_n\|_2}

for improved perception (Afouras et al., 2018), though most causal pipelines still employ noisy phase for tractability.

  • Strict causality and latency tuning: All encoders, temporal models (Emformer (Chen et al., 10 Jul 2024), GRU/LSTM (Zhu et al., 2023, Ma et al., 29 Jul 2025)), and vocoders are re-implemented as strictly causal, single-frame, or short-context modules, yielding end-to-end inference times \leq 40 ms/frame and overall system latencies << 120 ms.

2. Audio-Visual Feature Extraction and Synchronization

Robust extraction and synchronization of temporally aligned audio-visual cues are vital for real-time operation. Approaches include:

  • Visual features: Pre-trained AVSR models (such as AV-HuBERT or VSRiW (Ma et al., 29 Jul 2025, Ma et al., 25 Sep 2025)) generate embeddings at 25 Hz. Real-time systems upsample these to match audio frame rates (typically via nearest-neighbor or repetition, e.g., 4:1) or align using interpolations.
  • Audio features: Log-magnitude STFTs (window lengths 40–64 ms, hop 10–16 ms) are universal; some newer systems utilize causal raw waveform encoders (Chen et al., 10 Jul 2024, Zhu et al., 2023) to further reduce latency.
  • Feature alignment: Lightweight cross-attention modules (e.g., (Saleem et al., 26 Aug 2025)) and dynamic gating align and synchronize the features, compensating for natural audio-visual misalignments (tolerance up to \sim5 frames). Attention modules are formulated as:

S=QKdh,S~=S+Vbias,A=softmax(S~,dim=1),O=AVS = \frac{Q K^\top}{\sqrt{d_h}}, \quad \tilde{S} = S + V_{\text{bias}},\quad A = \operatorname{softmax}(\tilde{S}, \text{dim}=-1),\quad O = A \cdot V

  • Asynchrony and missing data: Causal frame buffers, data augmentation with time shifts, and "zero-out" training (where visual features are replaced with zeros for slices of the sequence (Chuang et al., 2020)) impart resilience to real-world sensor dropouts and sensor lag.

3. Algorithmic Latency, Streaming, and Hardware Considerations

Table: Latency and Causality in Recent Real-time AVSE Systems

System Algorithmic Latency Processing Platform Causal Modules
RT-LA-VocE (Chen et al., 10 Jul 2024) 28.15 ms/frame Standard GPU/CPU Yes (full pipeline)
RAVEN (Ma et al., 29 Jul 2025) 120 ms (5f, 2f LA) Apple M3 Max/CPU Yes (except 2f LA)
AVSEC3 (Saleem et al., 26 Aug 2025) 36 ms (full cycle) Low-power CPUs Yes
AV-E3Net (Zhu et al., 2023) <40 ms (RTF=0.143) CPU Yes
Real-time Glasses (Kealey et al., 2023) 120 ms NVIDIA Xavier NX Yes (mask+beamform)

Abbreviations: LA=Lookahead; RTF=Real-Time Factor

Systems achieve low latency by minimizing context windows, using efficient encoders/decoders (e.g., causal ResNets, GRUs), and—especially for embedded/hearing aid use—emphasizing low-parameter lightweight architectures (e.g., AVSEC3: 5.9M params, 23.54 MB (Saleem et al., 26 Aug 2025)).

Hardware implementations range from embedded Jetson boards (Kealey et al., 2023) and ARM CPUs (Ma et al., 29 Jul 2025), to standard desktop CPUs running Python-based pipelines (Ma et al., 25 Sep 2025). Some designs offload critical pre-processing (visual cropping/encoding (Chuang et al., 2020, Chuang et al., 2020)) to the sensor front-end to further reduce system-level latency.

4. Key Technical Innovations and Algorithms

Several technical innovations distinguish state-of-the-art real-time AVSE:

  • Phase and spectrogram re-synthesis: Multi-stage networks predict not only magnitude but also phase residuals (Afouras et al., 2018), refining phase via convolutional subnets for reduced "robotic" artifacts. Advanced systems employ causal neural vocoders (C-HiFi-GAN) (Chen et al., 10 Jul 2024), or diffusion-based wave synthesis conditioned on continuous AV-HuBERT features (Chou et al., 2023, Lin et al., 23 Jan 2025) instead of classical mask-based enhancement.
  • Gating and attention-based fusion: Approaches such as multi-stage gating-and-summation (GS) (Zhu et al., 2023) and cross-modal attention (AVSEC3 (Saleem et al., 26 Aug 2025), DLAV-SE (Lin et al., 23 Jan 2025)) allow selective, context-dependent fusion of visual and audio features, dynamically suppressing interfering speakers or adapting to background noise conditions.
  • Contextual and adaptive switching: Context-aware fusion can switch between visual, audio, or joint AV branches depending on SNR, as in the contextual AV switching module (Adeel et al., 2018), which adaptively exploits modality reliability without external SNR estimates.
  • Multimodal/language/semantic integration: Some systems (DLAV-SE (Lin et al., 23 Jan 2025)) integrate an auxiliary linguistic pathway during training, injecting LLM-derived embeddings (e.g., BERT) via cross-modal knowledge transfer (CMKT). This improves phonetic discrimination and reduces generative artifacts without requiring language features at inference.

5. Evaluation Methodologies, Metrics, and Empirical Results

AVSE systems are evaluated on both synthetic laboratory corpora and real-world noisy datasets:

6. Practical Limitations, Open Challenges, and Future Directions

Despite recent progress, the field faces several open issues:

  • Phase reconstruction: While advanced models (e.g., with phase subnets or neural vocoders) reduce phase artifacts, a gap to upper-bound quality with ground-truth phase remains (Afouras et al., 2018, Chen et al., 10 Jul 2024).
  • Synchronization and robustness: Performance is highly sensitive to temporal alignment between lip movements and audio (Afouras et al., 2018, Saleem et al., 26 Aug 2025). Real-world asynchrony and visual dropouts require robust fusion and adaptive handling (data augmentation, zero-out training (Chuang et al., 2020)).
  • Generalization to real-world conditions: Systems must handle a spectrum of unseen speakers, languages, dynamic acoustic environments, and low-quality sensors without retraining (Gogate et al., 2019, Ma et al., 25 Sep 2025).
  • Computational and energy efficiency: For hearing aid and mobile deployment, highly efficient architectures (AVSEC3: 36 ms latency, 23.54 MB (Saleem et al., 26 Aug 2025)) and offloaded or compressed visual encoding (autoencoder + quantization (Chuang et al., 2020, Chuang et al., 2020)) are crucial.
  • Multimodal semantics and transfer: Future AVSE frameworks integrate broader multimodal cues—linguistic, emotional, and scene context—using principled cross-modal adaptation to further mitigate “bleed-through” noise and unnatural artifacts (Lin et al., 23 Jan 2025, Hussain et al., 26 Feb 2024).

References to key papers:

(Afouras et al., 2018, Adeel et al., 2018, Gogate et al., 2019, Chuang et al., 2020, Chuang et al., 2020, Gogate et al., 2021, Yang et al., 2022, Kealey et al., 2023, Zhu et al., 2023, Chou et al., 2023, Hussain et al., 26 Feb 2024, Chen et al., 10 Jul 2024, Jain et al., 3 Sep 2024, Lin et al., 23 Jan 2025, Ma et al., 29 Jul 2025, Saleem et al., 26 Aug 2025, Ma et al., 25 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Real-Time Audio-Visual Speech Enhancement.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube