MeanFlow-TSE: One-Step Target Speaker Extraction

Updated 24 December 2025

The paper introduces MeanFlow-TSE, a one-step generative TSE framework that efficiently extracts target speech using a single network evaluation and achieves state-of-the-art perceptual metrics.
It employs a mixing-ratio-driven linear trajectory in the spectral domain and a neural mean-flow map to directly map mixtures to clean speech, yielding a +1.31 dB SI-SDR gain on Libri2Mix.
MeanFlow-TSE’s curriculum-based training and efficient architecture enable practical real-time deployment in streaming and edge applications with minimal computational overhead.

MeanFlow-TSE is a one-step generative target speaker extraction (TSE) framework based on mean-flow objectives. It is designed to efficiently and accurately extract a desired speaker’s voice from a multi-speaker mixture, while circumventing the computational burdens typical of diffusion and flow-matching approaches that require many iterative function evaluations. MeanFlow-TSE leverages a mixing-ratio-driven linear trajectory between background and target sources in the spectral domain and learns a neural mean-flow map, enabling direct, single-pass extraction of high-quality target speech. The system achieves real-time performance with competitive, state-of-the-art perceptual and fidelity metrics, as demonstrated on the Libri2Mix benchmark (Shimizu et al., 21 Dec 2025).

1. Problem Formulation and Background

In the TSE task, the observed signal is a mixture

$y(t) = s(t) + b(t)$

where $s(t)$ is the target speaker and $b(t)$ includes background and interfering speakers. Traditional discriminative approaches (e.g., Conv-TasNet with speaker embedding, SepFormer) estimate a mask or mapping $f_\theta(y, e) \to \hat{s}$ that minimizes waveform losses (e.g., SI-SNR, $\| \hat{s} - s \|^2$ ). While fast, these can introduce artifacts and generalize poorly. Generative paradigms, including diffusion and flow-matching models, learn a conditional density $p(s|y,e)$ , but require multi-step sampling—typically ≥10 network evaluations (NEFs)—limiting deployment in low-latency or real-time scenarios.

MeanFlow-TSE extends the "AD-FlowTSE" paradigm, which is anchored in modeling flows between background and target in the STFT (spectral) domain. It introduces a mixing ratio $\lambda \in [0,1]$ representing the balance between target and background: $Y = \lambda S + (1-\lambda) B,$ where $Y = \mathrm{STFT}(y)$ , $S = \mathrm{STFT}(s)$ , $B = \mathrm{STFT}(b)$ .

2. One-Step Mean-Flow Objective

The method parameterizes the extraction path as a convex linear interpolation in the spectral space: $z_t = t S + (1-t) B,\,\, t \in [0, 1]$ where $t$ governs the transition from background to the target. The instantaneous velocity is $u = S-B$ , constant with respect to $t$ , reflecting the straight-line nature of the mixing path.

Instead of numerically integrating instantaneous velocities, MeanFlow-TSE learns the average velocity over the interval $[t, r]$ : $z_r = z_t + (r-t) v_\text{avg}(z_t, t, r, e)$ For the TSE problem, at inference, $t = \lambda$ (estimated mixing ratio), $r = 1$ , and $e$ is the enrollment (reference speaker) embedding. The predicted target spectrogram is

$\hat{S} = Y + (1 - \hat{\lambda}) v_\theta(Y, \hat{\lambda}, 1, e)$

with $\hat{\lambda}$ output by a learned mixing ratio predictor. This single-step update realizes direct source extraction without iterative refinement.

3. Training Protocol and Model Architecture

The framework employs the "α-Flow" training regime, which interpolates between rectified flow matching (α=1) and mean-flow self-consistency (α→0), introducing a curriculum for stabilized learning. The hybrid target velocity is

$v_{t, r}^\alpha = \alpha u + (1-\alpha) v_\theta(z_\tau, \tau, r, e)$

with $\tau = \alpha r + (1-\alpha) t$ . The per-sample adaptive-weighted loss is

$\mathcal{L}_\text{adaptive}(\theta) = \mathbb{E}_{t, r, S, B, e}\left[ w \cdot \| v_\theta(z_t, t, r, e) - v_{t, r}^\alpha \|^2 \right]$

where $w = \alpha / (\| \Delta \|^2 + c)$ , $\Delta = v_\theta(z_t, t, r, e) - v_{t, r}^\alpha$ , and $c = 10^{-3}$ .

Architecturally, the backbone is a U-Net-style Diffusion Transformer (UDiT) with 16 transformer layers (hidden dim 768) and frequency-length inputs (512 × 500). Speaker conditioning is incorporated via cross-attention to an enrollment utterance embedding (ECAPA-TDNN), continually fused within the UDiT. The mixing ratio predictor $g_\phi$ is a small MLP acting on concatenated ECAPA embeddings of the mixture and enrollment. Optimization utilizes AdamW with cosine annealing and mixed precision; gradient clipping ensures numerical stability.

4. Inference, Efficiency, and Complexity

Inference proceeds as follows:

Short-time Fourier transform (STFT) is applied to $y$ , enrollment embedding is computed.
The mixing ratio predictor outputs $\hat{\lambda}$ .
The one-step update computes $\hat{S} = Y + (1-\hat{\lambda}) v_\theta(Y, \hat{\lambda}, 1, e)$ .
Inverse STFT reconstructs the waveform.

The framework requires only NFE=1 (one network evaluation per utterance), yielding real-time factor (RTF) ≈0.018 for 3 s audio on an NVIDIA L40 GPU. The model size is ≈359M parameters, with peak GPU memory ≈1.5 GB. Compared to diffusion and flow-matching baselines (e.g., NFE≥50, RTF≈0.75), computational overhead is negligible at similar or higher quality levels.

5. Empirical Performance and Ablation Studies

Evaluation on Libri2Mix employs intrusive metrics (SI-SDR, PESQ, ESTOI), non-intrusive measures (DNSMOS, OVRL), and speaker similarity (cosine-SIM).

Performance Comparison

Model	NFE	SI-SDR (dB)	PESQ	ESTOI
AD-FlowTSE	1	17.49	2.89	0.90
MeanFlow-TSE	1	18.80	3.26	0.93

MeanFlow-TSE achieves a +1.31 dB SI-SDR gain and similarly leads on perceptual metrics, both in clean and noisy settings.

Ablation Results

SI-SDR and PESQ peak at NFE=1; extra steps add only discretization error.
Removing the α curriculum (fixing α=1) reduces SI-SDR by ~0.7 dB—curriculum is necessary for stability.
The predicted mixing ratio $\hat{\lambda}$ approaches oracle performance, with <0.2 dB deficit.

6. Relationship to Other MeanFlow and Flow-Matching Methods

MeanFlow-TSE applies the central mean-flow principle: learning the average (not instantaneous) velocity of flow trajectories. This principle aligns with recent advances in one-step generative modeling for both image and audio domains. Comparable frameworks in speech enhancement (MeanFlowSE, MeanSE) show analogous efficiency–quality tradeoffs, requiring only a single network evaluation and yielding strong performance versus ODE/diffusion-based models (Li et al., 18 Sep 2025, Wang et al., 25 Sep 2025). In MeanFlow-TSE, the mixing-ratio-driven trajectory and curriculum-based training are specifically tailored to the TSE setting, directly mapping mixtures to clean target speech in a single pass.

7. Real-Time Applicability and Future Directions

MeanFlow-TSE is state-of-the-art in test-set SI-SDR, PESQ, ESTOI, and real-time factor among generative TSE frameworks. Its design enables deployment scenarios including streaming, hearing aids, and edge devices, due to minimal forward-pass latency and memory requirements. Future research aims to:

Extend the method to multi-channel and reverberant conditions (e.g. by conditioning flows on beamforming features).
Integrate metric-based fine-tuning, such as direct SI-SDR optimization.
Develop lighter-weight model variants for cost-constrained environments.

MeanFlow-TSE thus represents a substantial advance in efficient, high-fidelity, and practical generative target speaker extraction (Shimizu et al., 21 Dec 2025).