Papers
Topics
Authors
Recent
2000 character limit reached

MeanFlow-TSE: One-Step Target Speaker Extraction

Updated 24 December 2025
  • The paper introduces MeanFlow-TSE, a one-step generative TSE framework that efficiently extracts target speech using a single network evaluation and achieves state-of-the-art perceptual metrics.
  • It employs a mixing-ratio-driven linear trajectory in the spectral domain and a neural mean-flow map to directly map mixtures to clean speech, yielding a +1.31 dB SI-SDR gain on Libri2Mix.
  • MeanFlow-TSE’s curriculum-based training and efficient architecture enable practical real-time deployment in streaming and edge applications with minimal computational overhead.

MeanFlow-TSE is a one-step generative target speaker extraction (TSE) framework based on mean-flow objectives. It is designed to efficiently and accurately extract a desired speaker’s voice from a multi-speaker mixture, while circumventing the computational burdens typical of diffusion and flow-matching approaches that require many iterative function evaluations. MeanFlow-TSE leverages a mixing-ratio-driven linear trajectory between background and target sources in the spectral domain and learns a neural mean-flow map, enabling direct, single-pass extraction of high-quality target speech. The system achieves real-time performance with competitive, state-of-the-art perceptual and fidelity metrics, as demonstrated on the Libri2Mix benchmark (Shimizu et al., 21 Dec 2025).

1. Problem Formulation and Background

In the TSE task, the observed signal is a mixture

y(t)=s(t)+b(t)y(t) = s(t) + b(t)

where s(t)s(t) is the target speaker and b(t)b(t) includes background and interfering speakers. Traditional discriminative approaches (e.g., Conv-TasNet with speaker embedding, SepFormer) estimate a mask or mapping fθ(y,e)s^f_\theta(y, e) \to \hat{s} that minimizes waveform losses (e.g., SI-SNR, s^s2\| \hat{s} - s \|^2). While fast, these can introduce artifacts and generalize poorly. Generative paradigms, including diffusion and flow-matching models, learn a conditional density p(sy,e)p(s|y,e), but require multi-step sampling—typically ≥10 network evaluations (NEFs)—limiting deployment in low-latency or real-time scenarios.

MeanFlow-TSE extends the "AD-FlowTSE" paradigm, which is anchored in modeling flows between background and target in the STFT (spectral) domain. It introduces a mixing ratio λ[0,1]\lambda \in [0,1] representing the balance between target and background: Y=λS+(1λ)B,Y = \lambda S + (1-\lambda) B, where Y=STFT(y)Y = \mathrm{STFT}(y), S=STFT(s)S = \mathrm{STFT}(s), B=STFT(b)B = \mathrm{STFT}(b).

2. One-Step Mean-Flow Objective

The method parameterizes the extraction path as a convex linear interpolation in the spectral space: zt=tS+(1t)B,t[0,1]z_t = t S + (1-t) B,\,\, t \in [0, 1] where tt governs the transition from background to the target. The instantaneous velocity is u=SBu = S-B, constant with respect to tt, reflecting the straight-line nature of the mixing path.

Instead of numerically integrating instantaneous velocities, MeanFlow-TSE learns the average velocity over the interval [t,r][t, r]: zr=zt+(rt)vavg(zt,t,r,e)z_r = z_t + (r-t) v_\text{avg}(z_t, t, r, e) For the TSE problem, at inference, t=λt = \lambda (estimated mixing ratio), r=1r = 1, and ee is the enrollment (reference speaker) embedding. The predicted target spectrogram is

S^=Y+(1λ^)vθ(Y,λ^,1,e)\hat{S} = Y + (1 - \hat{\lambda}) v_\theta(Y, \hat{\lambda}, 1, e)

with λ^\hat{\lambda} output by a learned mixing ratio predictor. This single-step update realizes direct source extraction without iterative refinement.

3. Training Protocol and Model Architecture

The framework employs the "α-Flow" training regime, which interpolates between rectified flow matching (α=1) and mean-flow self-consistency (α→0), introducing a curriculum for stabilized learning. The hybrid target velocity is

vt,rα=αu+(1α)vθ(zτ,τ,r,e)v_{t, r}^\alpha = \alpha u + (1-\alpha) v_\theta(z_\tau, \tau, r, e)

with τ=αr+(1α)t\tau = \alpha r + (1-\alpha) t. The per-sample adaptive-weighted loss is

Ladaptive(θ)=Et,r,S,B,e[wvθ(zt,t,r,e)vt,rα2]\mathcal{L}_\text{adaptive}(\theta) = \mathbb{E}_{t, r, S, B, e}\left[ w \cdot \| v_\theta(z_t, t, r, e) - v_{t, r}^\alpha \|^2 \right]

where w=α/(Δ2+c)w = \alpha / (\| \Delta \|^2 + c), Δ=vθ(zt,t,r,e)vt,rα\Delta = v_\theta(z_t, t, r, e) - v_{t, r}^\alpha, and c=103c = 10^{-3}.

Architecturally, the backbone is a U-Net-style Diffusion Transformer (UDiT) with 16 transformer layers (hidden dim 768) and frequency-length inputs (512 × 500). Speaker conditioning is incorporated via cross-attention to an enrollment utterance embedding (ECAPA-TDNN), continually fused within the UDiT. The mixing ratio predictor gϕg_\phi is a small MLP acting on concatenated ECAPA embeddings of the mixture and enrollment. Optimization utilizes AdamW with cosine annealing and mixed precision; gradient clipping ensures numerical stability.

4. Inference, Efficiency, and Complexity

Inference proceeds as follows:

  1. Short-time Fourier transform (STFT) is applied to yy, enrollment embedding is computed.
  2. The mixing ratio predictor outputs λ^\hat{\lambda}.
  3. The one-step update computes S^=Y+(1λ^)vθ(Y,λ^,1,e)\hat{S} = Y + (1-\hat{\lambda}) v_\theta(Y, \hat{\lambda}, 1, e).
  4. Inverse STFT reconstructs the waveform.

The framework requires only NFE=1 (one network evaluation per utterance), yielding real-time factor (RTF) ≈0.018 for 3 s audio on an NVIDIA L40 GPU. The model size is ≈359M parameters, with peak GPU memory ≈1.5 GB. Compared to diffusion and flow-matching baselines (e.g., NFE≥50, RTF≈0.75), computational overhead is negligible at similar or higher quality levels.

5. Empirical Performance and Ablation Studies

Evaluation on Libri2Mix employs intrusive metrics (SI-SDR, PESQ, ESTOI), non-intrusive measures (DNSMOS, OVRL), and speaker similarity (cosine-SIM).

Performance Comparison

Model NFE SI-SDR (dB) PESQ ESTOI
AD-FlowTSE 1 17.49 2.89 0.90
MeanFlow-TSE 1 18.80 3.26 0.93

MeanFlow-TSE achieves a +1.31 dB SI-SDR gain and similarly leads on perceptual metrics, both in clean and noisy settings.

Ablation Results

  • SI-SDR and PESQ peak at NFE=1; extra steps add only discretization error.
  • Removing the α curriculum (fixing α=1) reduces SI-SDR by ~0.7 dB—curriculum is necessary for stability.
  • The predicted mixing ratio λ^\hat{\lambda} approaches oracle performance, with <0.2 dB deficit.

6. Relationship to Other MeanFlow and Flow-Matching Methods

MeanFlow-TSE applies the central mean-flow principle: learning the average (not instantaneous) velocity of flow trajectories. This principle aligns with recent advances in one-step generative modeling for both image and audio domains. Comparable frameworks in speech enhancement (MeanFlowSE, MeanSE) show analogous efficiency–quality tradeoffs, requiring only a single network evaluation and yielding strong performance versus ODE/diffusion-based models (Li et al., 18 Sep 2025, Wang et al., 25 Sep 2025). In MeanFlow-TSE, the mixing-ratio-driven trajectory and curriculum-based training are specifically tailored to the TSE setting, directly mapping mixtures to clean target speech in a single pass.

7. Real-Time Applicability and Future Directions

MeanFlow-TSE is state-of-the-art in test-set SI-SDR, PESQ, ESTOI, and real-time factor among generative TSE frameworks. Its design enables deployment scenarios including streaming, hearing aids, and edge devices, due to minimal forward-pass latency and memory requirements. Future research aims to:

  • Extend the method to multi-channel and reverberant conditions (e.g. by conditioning flows on beamforming features).
  • Integrate metric-based fine-tuning, such as direct SI-SDR optimization.
  • Develop lighter-weight model variants for cost-constrained environments.

MeanFlow-TSE thus represents a substantial advance in efficient, high-fidelity, and practical generative target speaker extraction (Shimizu et al., 21 Dec 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to MeanFlow-TSE.