Real-Time Audio-Visual Speech Enhancement

Updated 28 September 2025

Real-time audio-visual speech enhancement is a technique that isolates a target speaker's voice from noise using synchronized audio signals and lip-region cues under strict latency constraints.
Modern systems employ deep neural architectures with causal convolution, LSTM, and attention-based fusion to maintain robust performance even in adverse SNR conditions.
Key practical insights include low algorithmic latency (<120 ms/frame), the use of phase re-synthesis for natural sound quality, and resilience to asynchrony and sensor dropouts.

Real-time audio-visual speech enhancement (AVSE) refers to the isolation and restoration of a target speaker’s voice from noisy mixtures using synchronized audio and visual (typically lip region) cues, processed under low-latency computational constraints. Modern AVSE systems leverage deep neural architectures to exploit the complementary strengths of both modalities, achieving robust speech intelligibility and quality even for previously unseen speakers, unconstrained environments, and adverse signal-to-noise ratios (SNRs). This entry surveys the key methods and principles, technical innovations, empirical results, and challenges underlying this domain, with a focus on systems that achieve real-time, low-latency operation.

1. Architectural Principles for Real-time AVSE

Real-time AVSE architectures consistently pursue causality, frame-synchronous inference, and rapid fusion of modalities. Canonical pipelines (e.g., (Afouras et al., 2018, Ma et al., 29 Jul 2025, Chen et al., 2024)) feature:

Visual front-ends: Extraction of temporally-synchronized embeddings from cropped lip-region images, typically via 3D convolutional or residual (ResNet) backbones. In real-time designs (e.g., RT-LA-VocE (Chen et al., 2024), RAVEN (Ma et al., 29 Jul 2025)), all components are re-engineered for strict causality—visual convolutions are padded only on past frames and lookahead is minimized or eliminated.
Audio front-ends: Conversion of raw speech to log-magnitude spectrograms or raw-feature encodings (STFT, log1p, or causal 1D ResNet (Chen et al., 2024)) with parameters tuned for minimal windowing delay (window sizes as short as 40 ms).
Fusion mechanisms: Temporal convolution (TCN), LSTM, or lightweight attention modules merge the audio and visual streams. Some systems employ gating (AV-E3Net (Zhu et al., 2023)), cross-modality attention (AVSEC3 (Saleem et al., 26 Aug 2025)), or late-stage LSTM/FC fusion (RAVEN (Ma et al., 29 Jul 2025, Ma et al., 25 Sep 2025)).
Spectrogram masking or re-synthesis: Most real-time systems estimate a soft mask $\mathbf{M}$ applied to the noisy magnitude spectrogram, enhancing the corresponding frequency bins for the target speaker:

$\hat{M} = \sigma(\mathbf{W}_m^\top f_{\text{AV}}) \odot M_n$

where $f_{\text{AV}}$ denotes the fused features and $M_n$ the noisy magnitude (Afouras et al., 2018). Recent systems increasingly pursue re-synthesis via causal neural vocoders (C-HiFi-GAN (Chen et al., 2024)) or diffusion models (Chou et al., 2023, Lin et al., 23 Jan 2025).

Phase estimation/refinement: Phase sub-networks predict a residual that refines the noisy phase, e.g.,

$\hat{\Phi} = \frac{\mathbf{W}_\phi^\top \phi_{6} + \Phi_n}{\|\mathbf{W}_\phi^\top \phi_{6} + \Phi_n\|_2}$

for improved perception (Afouras et al., 2018), though most causal pipelines still employ noisy phase for tractability.

Strict causality and latency tuning: All encoders, temporal models (Emformer (Chen et al., 2024), GRU/LSTM (Zhu et al., 2023, Ma et al., 29 Jul 2025)), and vocoders are re-implemented as strictly causal, single-frame, or short-context modules, yielding end-to-end inference times $\leq$ 40 ms/frame and overall system latencies $<$ 120 ms.

2. Audio-Visual Feature Extraction and Synchronization

Robust extraction and synchronization of temporally aligned audio-visual cues are vital for real-time operation. Approaches include:

Visual features: Pre-trained AVSR models (such as AV-HuBERT or VSRiW (Ma et al., 29 Jul 2025, Ma et al., 25 Sep 2025)) generate embeddings at 25 Hz. Real-time systems upsample these to match audio frame rates (typically via nearest-neighbor or repetition, e.g., 4:1) or align using interpolations.
Audio features: Log-magnitude STFTs (window lengths 40–64 ms, hop 10–16 ms) are universal; some newer systems utilize causal raw waveform encoders (Chen et al., 2024, Zhu et al., 2023) to further reduce latency.
Feature alignment: Lightweight cross-attention modules (e.g., (Saleem et al., 26 Aug 2025)) and dynamic gating align and synchronize the features, compensating for natural audio-visual misalignments (tolerance up to $\sim$ 5 frames). Attention modules are formulated as:

$S = \frac{Q K^\top}{\sqrt{d_h}}, \quad \tilde{S} = S + V_{\text{bias}},\quad A = \operatorname{softmax}(\tilde{S}, \text{dim}=-1),\quad O = A \cdot V$

Asynchrony and missing data: Causal frame buffers, data augmentation with time shifts, and "zero-out" training (where visual features are replaced with zeros for slices of the sequence (Chuang et al., 2020)) impart resilience to real-world sensor dropouts and sensor lag.

3. Algorithmic Latency, Streaming, and Hardware Considerations

Table: Latency and Causality in Recent Real-time AVSE Systems

System	Algorithmic Latency	Processing Platform	Causal Modules
RT-LA-VocE (Chen et al., 2024)	28.15 ms/frame	Standard GPU/CPU	Yes (full pipeline)
RAVEN (Ma et al., 29 Jul 2025)	120 ms (5f, 2f LA)	Apple M3 Max/CPU	Yes (except 2f LA)
AVSEC3 (Saleem et al., 26 Aug 2025)	36 ms (full cycle)	Low-power CPUs	Yes
AV-E3Net (Zhu et al., 2023)	<40 ms (RTF=0.143)	CPU	Yes
Real-time Glasses (Kealey et al., 2023)	120 ms	NVIDIA Xavier NX	Yes (mask+beamform)

Abbreviations: LA=Lookahead; RTF=Real-Time Factor

Systems achieve low latency by minimizing context windows, using efficient encoders/decoders (e.g., causal ResNets, GRUs), and—especially for embedded/hearing aid use—emphasizing low-parameter lightweight architectures (e.g., AVSEC3: 5.9M params, 23.54 MB (Saleem et al., 26 Aug 2025)).

Hardware implementations range from embedded Jetson boards (Kealey et al., 2023) and ARM CPUs (Ma et al., 29 Jul 2025), to standard desktop CPUs running Python-based pipelines (Ma et al., 25 Sep 2025). Some designs offload critical pre-processing (visual cropping/encoding (Chuang et al., 2020, Chuang et al., 2020)) to the sensor front-end to further reduce system-level latency.

4. Key Technical Innovations and Algorithms

Several technical innovations distinguish state-of-the-art real-time AVSE:

Phase and spectrogram re-synthesis: Multi-stage networks predict not only magnitude but also phase residuals (Afouras et al., 2018), refining phase via convolutional subnets for reduced "robotic" artifacts. Advanced systems employ causal neural vocoders (C-HiFi-GAN) (Chen et al., 2024), or diffusion-based wave synthesis conditioned on continuous AV-HuBERT features (Chou et al., 2023, Lin et al., 23 Jan 2025) instead of classical mask-based enhancement.
Gating and attention-based fusion: Approaches such as multi-stage gating-and-summation (GS) (Zhu et al., 2023) and cross-modal attention (AVSEC3 (Saleem et al., 26 Aug 2025), DLAV-SE (Lin et al., 23 Jan 2025)) allow selective, context-dependent fusion of visual and audio features, dynamically suppressing interfering speakers or adapting to background noise conditions.
Contextual and adaptive switching: Context-aware fusion can switch between visual, audio, or joint AV branches depending on SNR, as in the contextual AV switching module (Adeel et al., 2018), which adaptively exploits modality reliability without external SNR estimates.
Multimodal/language/semantic integration: Some systems (DLAV-SE (Lin et al., 23 Jan 2025)) integrate an auxiliary linguistic pathway during training, injecting LLM-derived embeddings (e.g., BERT) via cross-modal knowledge transfer (CMKT). This improves phonetic discrimination and reduces generative artifacts without requiring language features at inference.

5. Evaluation Methodologies, Metrics, and Empirical Results

AVSE systems are evaluated on both synthetic laboratory corpora and real-world noisy datasets:

Objective metrics: Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI), Scale-Invariant SDR (SI-SDR), and Word Error Rate (WER) are standard for quantifying restored quality and intelligibility (Afouras et al., 2018, Ma et al., 29 Jul 2025, Jain et al., 2024). SI-SDR in particular is a key fidelity index for modern models.
Latency and computational cost: Reports of real-time throughput include RTF (ratio of processing to real duration), frame-by-frame latency (ms), and end-to-end CPU performance (Chen et al., 2024, Zhu et al., 2023, Ma et al., 29 Jul 2025).
Subjective studies: Mean Opinion Score (MOS), MUSHRA-style paired comparisons, and CMOS are used for listening evaluations (Gogate et al., 2019, Chou et al., 2023, Ma et al., 25 Sep 2025).
Empirical findings:
- AVSE systems consistently outperform audio-only counterparts, with strongest gains at low SNRs and in multi-speaker mixtures (Afouras et al., 2018, Zhu et al., 2023, Ma et al., 29 Jul 2025).
- Visual front-ends trained for AVSR outperform generic face/lip detectors for suppression of competing talkers (Ma et al., 29 Jul 2025).
- Multimodal fusion with semantic or emotional context (e.g., emotion-aware AVSE (Hussain et al., 2024), linguistic transfer (Lin et al., 23 Jan 2025)) yields further performance gains, particularly for intelligibility.

6. Practical Limitations, Open Challenges, and Future Directions

Despite recent progress, the field faces several open issues:

Phase reconstruction: While advanced models (e.g., with phase subnets or neural vocoders) reduce phase artifacts, a gap to upper-bound quality with ground-truth phase remains (Afouras et al., 2018, Chen et al., 2024).
Synchronization and robustness: Performance is highly sensitive to temporal alignment between lip movements and audio (Afouras et al., 2018, Saleem et al., 26 Aug 2025). Real-world asynchrony and visual dropouts require robust fusion and adaptive handling (data augmentation, zero-out training (Chuang et al., 2020)).
Generalization to real-world conditions: Systems must handle a spectrum of unseen speakers, languages, dynamic acoustic environments, and low-quality sensors without retraining (Gogate et al., 2019, Ma et al., 25 Sep 2025).
Computational and energy efficiency: For hearing aid and mobile deployment, highly efficient architectures (AVSEC3: 36 ms latency, 23.54 MB (Saleem et al., 26 Aug 2025)) and offloaded or compressed visual encoding (autoencoder + quantization (Chuang et al., 2020, Chuang et al., 2020)) are crucial.
Multimodal semantics and transfer: Future AVSE frameworks integrate broader multimodal cues—linguistic, emotional, and scene context—using principled cross-modal adaptation to further mitigate “bleed-through” noise and unnatural artifacts (Lin et al., 23 Jan 2025, Hussain et al., 2024).