MeanFlow-TSE: One-Step Target Speaker Extraction
- The paper introduces MeanFlow-TSE, a one-step generative TSE framework that efficiently extracts target speech using a single network evaluation and achieves state-of-the-art perceptual metrics.
- It employs a mixing-ratio-driven linear trajectory in the spectral domain and a neural mean-flow map to directly map mixtures to clean speech, yielding a +1.31 dB SI-SDR gain on Libri2Mix.
- MeanFlow-TSE’s curriculum-based training and efficient architecture enable practical real-time deployment in streaming and edge applications with minimal computational overhead.
MeanFlow-TSE is a one-step generative target speaker extraction (TSE) framework based on mean-flow objectives. It is designed to efficiently and accurately extract a desired speaker’s voice from a multi-speaker mixture, while circumventing the computational burdens typical of diffusion and flow-matching approaches that require many iterative function evaluations. MeanFlow-TSE leverages a mixing-ratio-driven linear trajectory between background and target sources in the spectral domain and learns a neural mean-flow map, enabling direct, single-pass extraction of high-quality target speech. The system achieves real-time performance with competitive, state-of-the-art perceptual and fidelity metrics, as demonstrated on the Libri2Mix benchmark (Shimizu et al., 21 Dec 2025).
1. Problem Formulation and Background
In the TSE task, the observed signal is a mixture
where is the target speaker and includes background and interfering speakers. Traditional discriminative approaches (e.g., Conv-TasNet with speaker embedding, SepFormer) estimate a mask or mapping that minimizes waveform losses (e.g., SI-SNR, ). While fast, these can introduce artifacts and generalize poorly. Generative paradigms, including diffusion and flow-matching models, learn a conditional density , but require multi-step sampling—typically ≥10 network evaluations (NEFs)—limiting deployment in low-latency or real-time scenarios.
MeanFlow-TSE extends the "AD-FlowTSE" paradigm, which is anchored in modeling flows between background and target in the STFT (spectral) domain. It introduces a mixing ratio representing the balance between target and background: where , , .
2. One-Step Mean-Flow Objective
The method parameterizes the extraction path as a convex linear interpolation in the spectral space: where governs the transition from background to the target. The instantaneous velocity is , constant with respect to , reflecting the straight-line nature of the mixing path.
Instead of numerically integrating instantaneous velocities, MeanFlow-TSE learns the average velocity over the interval : For the TSE problem, at inference, (estimated mixing ratio), , and is the enrollment (reference speaker) embedding. The predicted target spectrogram is
with output by a learned mixing ratio predictor. This single-step update realizes direct source extraction without iterative refinement.
3. Training Protocol and Model Architecture
The framework employs the "α-Flow" training regime, which interpolates between rectified flow matching (α=1) and mean-flow self-consistency (α→0), introducing a curriculum for stabilized learning. The hybrid target velocity is
with . The per-sample adaptive-weighted loss is
where , , and .
Architecturally, the backbone is a U-Net-style Diffusion Transformer (UDiT) with 16 transformer layers (hidden dim 768) and frequency-length inputs (512 × 500). Speaker conditioning is incorporated via cross-attention to an enrollment utterance embedding (ECAPA-TDNN), continually fused within the UDiT. The mixing ratio predictor is a small MLP acting on concatenated ECAPA embeddings of the mixture and enrollment. Optimization utilizes AdamW with cosine annealing and mixed precision; gradient clipping ensures numerical stability.
4. Inference, Efficiency, and Complexity
Inference proceeds as follows:
- Short-time Fourier transform (STFT) is applied to , enrollment embedding is computed.
- The mixing ratio predictor outputs .
- The one-step update computes .
- Inverse STFT reconstructs the waveform.
The framework requires only NFE=1 (one network evaluation per utterance), yielding real-time factor (RTF) ≈0.018 for 3 s audio on an NVIDIA L40 GPU. The model size is ≈359M parameters, with peak GPU memory ≈1.5 GB. Compared to diffusion and flow-matching baselines (e.g., NFE≥50, RTF≈0.75), computational overhead is negligible at similar or higher quality levels.
5. Empirical Performance and Ablation Studies
Evaluation on Libri2Mix employs intrusive metrics (SI-SDR, PESQ, ESTOI), non-intrusive measures (DNSMOS, OVRL), and speaker similarity (cosine-SIM).
Performance Comparison
| Model | NFE | SI-SDR (dB) | PESQ | ESTOI |
|---|---|---|---|---|
| AD-FlowTSE | 1 | 17.49 | 2.89 | 0.90 |
| MeanFlow-TSE | 1 | 18.80 | 3.26 | 0.93 |
MeanFlow-TSE achieves a +1.31 dB SI-SDR gain and similarly leads on perceptual metrics, both in clean and noisy settings.
Ablation Results
- SI-SDR and PESQ peak at NFE=1; extra steps add only discretization error.
- Removing the α curriculum (fixing α=1) reduces SI-SDR by ~0.7 dB—curriculum is necessary for stability.
- The predicted mixing ratio approaches oracle performance, with <0.2 dB deficit.
6. Relationship to Other MeanFlow and Flow-Matching Methods
MeanFlow-TSE applies the central mean-flow principle: learning the average (not instantaneous) velocity of flow trajectories. This principle aligns with recent advances in one-step generative modeling for both image and audio domains. Comparable frameworks in speech enhancement (MeanFlowSE, MeanSE) show analogous efficiency–quality tradeoffs, requiring only a single network evaluation and yielding strong performance versus ODE/diffusion-based models (Li et al., 18 Sep 2025, Wang et al., 25 Sep 2025). In MeanFlow-TSE, the mixing-ratio-driven trajectory and curriculum-based training are specifically tailored to the TSE setting, directly mapping mixtures to clean target speech in a single pass.
7. Real-Time Applicability and Future Directions
MeanFlow-TSE is state-of-the-art in test-set SI-SDR, PESQ, ESTOI, and real-time factor among generative TSE frameworks. Its design enables deployment scenarios including streaming, hearing aids, and edge devices, due to minimal forward-pass latency and memory requirements. Future research aims to:
- Extend the method to multi-channel and reverberant conditions (e.g. by conditioning flows on beamforming features).
- Integrate metric-based fine-tuning, such as direct SI-SDR optimization.
- Develop lighter-weight model variants for cost-constrained environments.
MeanFlow-TSE thus represents a substantial advance in efficient, high-fidelity, and practical generative target speaker extraction (Shimizu et al., 21 Dec 2025).