AD-FlowTSE: Adaptive Flow Matching for TSE

Updated 23 October 2025

The paper’s main contribution is the introduction of an MR-aware adaptive integration schedule that dynamically aligns the flow from mixture to target speech.
It employs a novel neural velocity estimator and optimal transport loss to deterministically model the background-to-target transformation based on estimated mixing ratios.
Experimental results show improved PESQ, SI-SDR, and speaker similarity scores, confirming efficiency, accuracy, and robustness over fixed-step methods.

Adaptive Discriminative Flow Matching TSE (AD-FlowTSE) is a generative approach for target speaker extraction that leverages flow matching within a mixing-ratio-adaptive framework. Unlike conventional discriminative or generative methods with fixed reverse schedules and step sizes, AD-FlowTSE models the mixture-to-source trajectory according to the actual composition of the input, yielding efficient, noise-aware, and highly accurate speech extraction. The method’s core innovation is the adaptive integration schedule which is dynamically aligned with an estimated mixing ratio (MR) for each mixture, allowing initialization and computation that precisely match the required transformation from noisy input to clean target speech.

1. Formulation and Paradigm

AD-FlowTSE departs from prior target speaker extraction (TSE) approaches by parametrizing the transport path between background and target speech according to the estimated mixing ratio. In conventional FM-based speech enhancement, the process commonly integrates from a generic input (full mixture or a prior) to the clean speech, irrespective of mixture composition. AD-FlowTSE instead constructs the mixture as

$x = (1 - \tau) \cdot b + \tau \cdot s_1,$

where $b$ is the background source, $s_1$ is the target speech, and $\tau \in [0,1]$ is the mixing ratio (MR) reflecting the proportion of target in the mixture. The conditional probability path is modeled as

$p_t(x_t\,|\,b, s_1) = \mathcal{N}(x_t; \mu_t(b, s_1), \sigma^2_t I) \quad\text{with}\quad \mu_t(b, s_1) = (1 - \tau) b + \tau s_1,$

and $u_t(x_t\,|\,b,s_1) = s_1 - b$ in deterministic form ( $\sigma^2_t = 0$ ).

2. MR-aware Initialization and Adaptive Step Size

A novel feature in AD-FlowTSE is MR-aware initialization, implemented via a separate predictor network $g_{\phi}(x, e)$ (with $e$ the enrollment reference) which estimates $\tau$ (or a proxy $\hat{\tau}$ ) from the mixture and enrollment. The extraction process then begins precisely at $x_{\hat{\tau}} = x$ and integrates only over the residual segment $[\hat{\tau}, 1]$ .

The reverse schedule uses an adaptive step size: when the estimated MR is high (input is close to target speech), the algorithm performs few steps—sometimes just one; for low MRs (noisy mixtures), it executes more steps to traverse the required transformation in feature space. This enables efficient allocation of computation and helps avoid over-correction (hallucination of target in clean mixtures) or under-processing (insufficient denoising in noisy conditions).

3. Optimization and Training

Training consists of regressing a neural velocity estimator $v_\theta(x_t, e, \tau)$ toward the “oracle” vector field $(s_1 - b)$ , using an optimal transport conditional flow matching loss:

$\mathcal{L}_{\text{OT--CFM}}(\theta) = \mathbb{E}_{(b,s_1,e),\tau,x_t}\big[\|v_\theta(x_t, e, \tau) - (s_1 - b)\|^2\big].$

This objective ensures the model learns a trajectory along the background-target continuum that matches the precise transformation needed for clean speech extraction. The process remains fully deterministic, with the integration path dynamically adapted per mixture.

4. Experimental Results and Comparative Advantages

AD-FlowTSE demonstrates strong quantitative performance on standard TSE benchmarks (Hsieh et al., 19 Oct 2025). Results highlight improvements over both discriminative and generative methods with fixed reverse schedules:

Perceptual Evaluation of Speech Quality (PESQ): AD-FlowTSE exceeds previous bests, indicating consistently high audio quality.
Scale-invariant SDR (SI-SDR): The adaptive MR-aware extraction produces higher distortion resistance, particularly in variable noise conditions.
Speaker similarity (SIM): The extracted signals exhibit closer match to the enrolled target than baseline models, confirming robust identity preservation.

Ablation studies confirm the effectiveness of MR-aware initialization—usage of oracle or predicted $\hat{\tau}$ yields stable performance, while a random MR degrades accuracy. Analysis of number of function evaluations (NFEs) shows AD-FlowTSE achieves optimal results with minimal steps, while excessive steps risk error accumulation.

5. Comparison with Prior Methods

AD-FlowTSE diverges from traditional discriminative mapping (fixed network, direct mixture-to-target mapping), and earlier generative flow matching or diffusion models (fixed reverse schedules). Its key advantages include:

Efficiency: Adaptive computation prevents unnecessary transformation in clean inputs, while robustly processing noisy mixtures.
Accuracy: MR-aware alignment minimizes risk of model hallucination or under-processing and preserves source fidelity.
Deterministic Integration: Avoids randomness inherent in generative sampling, favoring reproducibility in practical systems.

A plausible implication is that model robustness critically depends on MR estimation accuracy; inaccuracies may cause either insufficient or excessive transformation, though results suggest the MR predictor is reliable in practice.

6. Limitations and Future Directions

The efficacy of AD-FlowTSE is bounded by the accuracy of the MR predictor. Enhancements may include incorporation of dynamic mixture analysis (noise/reverberation cues), use of multi-channel or context-aware MR estimation, and more sophisticated variable-step solvers for the ODE integration.

Further development includes extending the framework to multi-channel TSE, reverberant environments, and hybrid training objectives combining discriminative and generative losses. This suggests continued advances in robustness and generalization for adaptive flow matching in speech and related sequential extraction domains.

7. Significance and Impact

AD-FlowTSE establishes a new methodological approach in generative speech extraction, demonstrating that aligning the transport trajectory with mixture composition and implementing adaptive step sizes yields high performance with computational efficiency. The approach’s key principles—optimal transport flow matching, MR-aware initialization, and adaptive integration—serve as a technical foundation for future systems in speech enhancement, separation, and wider time-series extraction contexts.

This method’s success across multiple metrics highlights the potential of mixture-aware generative modeling. By integrating adaptive deterministic flow matching, AD-FlowTSE sets a robust baseline for subsequent research and development in efficient, high-quality target source extraction.

PDF Markdown Chat (Pro)

References (1)

Adaptive Deterministic Flow Matching for Target Speaker Extraction (2025)

Follow Topic

Get notified by email when new papers are published related to Adaptive Discriminative Flow Matching TSE (AD-FlowTSE).