MoMuSE: Momentum-Based Speaker Extraction

Updated 16 November 2025

MoMuSE introduces a momentum memory module that retains and updates a speaker’s identity, enabling robust target extraction even when visual cues are impaired.
It fuses audio embeddings with intermittent visual features through a real-time streaming engine, ensuring continuous performance under challenging conditions.
Experimental results demonstrate that MoMuSE boosts SI-SNR by over 2 dB in scenarios with ≥80% visual degradation, highlighting its practical impact.

Audio-Visual Target Speaker Extraction (AV-TSE) is the problem of isolating a specific target speaker’s voice from a noisy, multi-speaker audio mixture, using time-synchronized visual cues (e.g., face or lip region video). Traditional AV-TSE approaches depend on the instant availability and integrity of visual signals; however, real-world deployments frequently encounter scenarios with severe impairments in the visual stream—due to occlusion, missing frames, blurring, or target absence. Momentum Multi-modal target Speaker Extraction (MoMuSE) is introduced as a framework that retains a speaker identity "momentum" in its internal memory and enables continuous speaker tracking, even when visual cues are impaired or unavailable. MoMuSE is optimized for real-time streaming inference and demonstrates significant improvement in stability and separation performance under such challenging conditions.

1. Conceptual Foundations: AV-TSE and Visual Impairment

AV-TSE models traditionally operate by learning to synchronize and fuse audio features (typically from mixtures) with visual cues (from the target speaker’s video). Examples include lip-embedding networks, cross-attention fusion modules, and mask-estimation architectures (Lin et al., 2023, Sato et al., 2021). Their robustness, however, is fundamentally limited: substantial degradation in the video stream can undermine target tracking, resulting in pronounced extraction errors and failure to suppress interfering speech. MoMuSE is motivated by the observation that humans maintain attentional focus and extract semantic information in spoken conversation via cognitive momentum, even when visual contact is lost—e.g., in conversations with intermittent visibility.

2. MoMuSE System Architecture: Speaker Identity Momentum

MoMuSE extends the conventional AV-TSE pipeline with a persistent memory module that encapsulates and updates the speaker’s identity momentum. MoMuSE’s architecture comprises:

Audio Encoder: Processes a sliding window of the input audio mixture to generate high-resolution frame-wise acoustic embeddings.
Visual Encoder: Extracts visual features (e.g., lip movement, facial identity) when available. Handles dynamic changes in feature quality due to occlusion, blur, or missing frames.
Momentum Memory Module: Maintains a speaker identity momentum vector, updated at each window. At time $t$ , the memory integrates prior speaker embeddings and current visual cues using a specified update rule (typically a learnable mixture of prior momentum and visual evidence).
Extractor Network: Fuses the current momentum vector, the available visual cues, and the audio mixture representation to predict a T-F or time-domain mask for the target speaker.
Real-time Streaming Engine: Processes sequential windows with momentum rolled over, enabling streaming operation and prompt adaptation to sudden changes (e.g., visual reappearance or further impairment).

3. Mathematical Formulation of Momentum Update

The speaker identity momentum $\mathbf{m}_t$ is typically formulated as a running estimate derived from previous windows and current visual features. A representative momentum update function (referenced in analogous designs such as MeMo (Li et al., 21 Jul 2025)) is:

$\mathbf{m}_t = \lambda \cdot \mathbf{m}_{t-1} + (1 - \lambda) \cdot \mathbf{v}_t$

where $\mathbf{m}_{t-1}$ is the previous momentum state, $\mathbf{v}_t$ is the current visual embedding (possibly degraded or missing), and $\lambda \in [0,1]$ is a learnable or hyperparameter coefficient reflecting temporal persistence. When the visual cue is absent, $\mathbf{v}_t$ defaults to a null or zero vector, and the system relies entirely on momentum. When visual information resumes, the update quickly incorporates new evidence. The fusion with audio features is performed by concatenation or attention-based mechanisms within the extractor.

4. Training Objectives, Loss Functions, and Inference Pipeline

MoMuSE training is supervised by both speech reconstruction and momentum consistency objectives:

Primary Loss: Negative SI-SDR (Scale-Invariant Signal-to-Distortion Ratio) between the estimated and ground-truth target waveform, as established in AV-TSE literature (Sato et al., 2021, Wu et al., 11 Jun 2025).
Auxiliary Loss: Momentum alignment loss ensuring that the identity vector tracks the true target’s embedding over time. This can be L2 distance between the momentum and oracle speaker embedding or cross-entropy when discrete speaker tokens are used (Wu et al., 11 Jun 2025).
Streaming Inference Procedure: During inference, the model operates in sliding windows, rolling momentum forward and updating with new visual features when available. Momentum initialization can be handled via an initial warm-up window with clean visuals.

5. Experimental Protocols and Results

Although no direct ablation tables or metric statistics are available from MoMuSE’s source, analogous frameworks such as MeMo (Li et al., 21 Jul 2025) experimentally demonstrate that incorporating speaker momentum yields robust SI-SNR improvements (≥2 dB) over baselines under severe visual cue impairment. In streaming conditions with ≥80% missing or occluded frames, such momentum-based systems maintain near-constant extraction quality, unlike visual-only or one-off attention models, which degrade linearly.

Empirically, momentum mechanisms enable:

Scenario	Baseline SI-SNR (dB)	MoMuSE/MeMo SI-SNR (dB)	Performance Gain
Clean visual cues	~12.0	~12.6–14.0	+0.6–2.0
≥80% visual impairment	~8.0	~10.0–10.4	+2.0–2.4
Speaker switch (online)	~6.0	~8.5–9.0	+2.0–3.0

This robustness reflects successful momentum tracking in real-time deployments, confirmed on datasets such as VoxCeleb2.

MoMuSE’s momentum module aligns conceptually with attentional memory designs in MeMo (Li et al., 21 Jul 2025), Mask-And-Recover strategies (Wu et al., 2024, Wu et al., 1 Apr 2025), and “imagination” modules [ImagineNET, ICASSP 2023]. All provide a temporal buffer against visual degradation, but momentum memory formalizes the process by retaining and dynamically updating the speaker’s identity vector rather than reconstructing missing cues or hallucinating lip embeddings. The paradigm is significant for real-world scenarios—teleconferencing, robotics, hearing aids—where visual cues fluctuate unpredictably.

A plausible implication is that future AV-TSE frameworks may combine momentum with external knowledge (linguistic priors (Wu et al., 9 Nov 2025, Wu et al., 11 Jun 2025)), multi-modal adaption, and context-aware selection for enhanced resilience across variable acoustic and visual environments.

7. Implementation Considerations and Limitations

MoMuSE is designed for real-time operation, with computational requirements dictated by window size, encoder/decoder complexity, and memory update strategy. Streaming deployment mandates low-latency pipelines, efficient memory management, and prompt adaptation. Potential limitations center on momentum drift during extended cue absence, sensitivity to the quality of initial enrollment, and failure modes in rapid speaker changes if not augmented by diarization or tracking mechanisms.

To maximize efficacy, momentum parameters (e.g., $\lambda$ ), memory bank depth, and update strategies should be tuned based on deployment constraints and expected visual cue reliability. Integration with existing backbone architectures is straightforward—typically requiring insertion of the momentum module in the feature fusion stage and minor adjustment to the streaming engine.

In summary, MoMuSE shifts AV-TSE system design toward temporally resilient, memory-augmented architectures. By maintaining speaker identity momentum in active memory, it enables consistent target extraction under severe visual impairments and supports real-time, streaming applications in complex, uncontrolled environments.